Closed stschiff closed 1 month ago
Attention: Patch coverage is 22.22222%
with 7 lines
in your changes missing coverage. Please review.
Project coverage is 60.40%. Comparing base (
c401b90
) to head (4507c06
). Report is 9 commits behind head on master.
Files with missing lines | Patch % | Lines |
---|---|---|
src/Poseidon/CLI/OptparseApplicativeParsers.hs | 33.33% | 1 Missing and 3 partials :warning: |
src/Poseidon/Package.hs | 0.00% | 2 Missing and 1 partial :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Brilliant that this is on the way! We should remember to enable this feature then also in the next schema version. Before I'll get into the proper review and testing I would like to clarify if this change really has to be breaking.
I understand that you can't do IO in OP.Parser
, so -p
will always only ever know about one file. But why can't we keep the previous logic of allowing any of ".geno", ".snp", ".ind", ".bed", ".bim", ".fam" for the extensions? Instead of takeExtension
we could just use takeExtensions
to get all extensions of the given file. And then match against these four groups:
Of course ".ind" and ".fam" would match the wrong groups if the genotype data is zipped. But that is a simple error case already covered in the CLI documentation:
If a gzipped genotype file is given, it is assumed that the corresponding .snp.gz or .bim.gz file is also gzipped (but not the .fam or .ind file)
Was this error case the only reason why we can not allow to give ".snp", ".ind", ".bim" and ".fam" any more? If so, then I'd argue a breaking change weighs heavier than that :thinking:
Yes, I think I like this idea. Indeed makes it somewhat clearer for the user, and avoids the breaking change. So just to be clear: You would then say that if the user gives a *.fam
or a *.ind
file, we (somewhat arbitrarily) assume that the other two files are unzipped. Correct? Cause that is not covered by the current CLI documentation. The current CLI documentation goes the other way around: If you give a zipped .geno
, .snp
, .bed
or .bum
file, it assumes the other is also zipped. Anyway, I think I like the solution and will adapt the API.
Oh boy, that was actually a bug! I should have used takeExtensions
anyway. takeExtension
can never return .geno.gz
, so that was actually a bug. Thanks for the unintentional catch!
OK, I have adapted this now as you suggested. So now it's not a breaking change anymore. The code is also quite clear now:
readGenoInput p = makeGenoInput (dropExtensions p) (takeExtensions p)
makeGenoInput path ext
| ext `elem` ["geno", "snp", "ind"] = Right (GenotypeFormatEigenstrat, path <.> "geno", path <.> "snp", path <.> "ind")
| ext `elem` ["geno.gz", "snp.gz" ] = Right (GenotypeFormatEigenstrat, path <.> "geno.gz", path <.> "snp.gz", path <.> "ind")
| ext `elem` ["bed", "bim", "fam"] = Right (GenotypeFormatPlink, path <.> "bed", path <.> "bim", path <.> "fam")
| ext `elem` ["bed.gz", "bim.gz" ] = Right (GenotypeFormatPlink, path <.> "bed.gz", path <.> "bim.gz", path <.> "fam")
| otherwise = Left $ "unknown file extension: " ++ ext
I added a new test module specifically for CLI interface parsers in 0b6db85, because our golden tests do not cover them (interesting design decision I made back then... but here we are :shrug:).
I think it helped immediately to find and fix a mistake in -p
. takeExtensions
grabs the leading .
before the extensions, and dropExtensions
removes it. So it must be given explicitly in the extension strings in readGenoInput
.
Excellent! I also thought that we should test this. Great idea to just start this with one simple interface case.
OK, just to be clear, are you OK with merging this now?
Well - the changelog and the command line documentation must be adjusted after the last changes.
May I ask (again?) why you only implemented reading and not also writing? The feature would have a lot more impact if writing was possible as well.
Gzipped writing support requires additions to sequence-formats, and I prefer to get these features implemented one by one. We don't have to release before everything is done, of course, but I would like to get this PR in first, and then deal with the rest. I have another PR dealing with VCF-reading which is built on top of this one, which is the next step. After that, I will take care of writing.
I have edited the Changelog. The CLI was already adapted to the new change.
I tested it finally and it worked as expected. So this can be merged, imho. A major version jump is not necessary as this is not breaking any more.
Ah right. Cool, I'll adapt the version and merge it into master then. Thanks!
This PR links to a newer version of sequence-formats, which adds support for reading gzipped genotype files. Here is what I wrote in the Changelog:
-p, --genoOne
flag, where we now allow only.bed
,.bed.gz
,.geno
or.geno.gz
files to define the trio of genotype files.