Gzip reading-support - Githubissues

stschiff commented 2 months ago

This PR links to a newer version of sequence-formats, which adds support for reading gzipped genotype files. Here is what I wrote in the Changelog:

Linked to sequence-formats 1.8.1.0, which adds reading support for gzipped Plink (.bed and .bim) and Eigenstrat (.geno and .snp) files.
gzipped files are recognised automatically by their file ending
A mild but technically breaking change is the behaviour of init, genoconvert and forge with the -p, --genoOne flag, where we now allow only .bed, .bed.gz, .geno or .geno.gz files to define the trio of genotype files.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 22.22222% with 7 lines in your changes missing coverage. Please review.

Project coverage is 60.40%. Comparing base (c401b90) to head (4507c06). Report is 9 commits behind head on master.

Files with missing lines	Patch %	Lines
src/Poseidon/CLI/OptparseApplicativeParsers.hs	33.33%	1 Missing and 3 partials :warning:
src/Poseidon/Package.hs	0.00%	2 Missing and 1 partial :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #305 +/- ## ========================================== - Coverage 68.37% 60.40% -7.97% ========================================== Files 26 27 +1 Lines 3554 4031 +477 Branches 403 409 +6 ========================================== + Hits 2430 2435 +5 - Misses 721 1187 +466 - Partials 403 409 +6 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

nevrome commented 2 months ago

Brilliant that this is on the way! We should remember to enable this feature then also in the next schema version. Before I'll get into the proper review and testing I would like to clarify if this change really has to be breaking.

I understand that you can't do IO in OP.Parser, so -p will always only ever know about one file. But why can't we keep the previous logic of allowing any of ".geno", ".snp", ".ind", ".bed", ".bim", ".fam" for the extensions? Instead of takeExtension we could just use takeExtensions to get all extensions of the given file. And then match against these four groups:

".geno", ".snp", ".ind"
".bed", ".bim", ".fam"
".geno.gz", ".snp.gz", ".ind"
".bed.gz", ".bim.gz", ".fam"

Of course ".ind" and ".fam" would match the wrong groups if the genotype data is zipped. But that is a simple error case already covered in the CLI documentation:

If a gzipped genotype file is given, it is assumed that the corresponding .snp.gz or .bim.gz file is also gzipped (but not the .fam or .ind file)

Was this error case the only reason why we can not allow to give ".snp", ".ind", ".bim" and ".fam" any more? If so, then I'd argue a breaking change weighs heavier than that :thinking:

stschiff commented 2 months ago

Yes, I think I like this idea. Indeed makes it somewhat clearer for the user, and avoids the breaking change. So just to be clear: You would then say that if the user gives a *.fam or a *.ind file, we (somewhat arbitrarily) assume that the other two files are unzipped. Correct? Cause that is not covered by the current CLI documentation. The current CLI documentation goes the other way around: If you give a zipped .geno, .snp, .bed or .bum file, it assumes the other is also zipped. Anyway, I think I like the solution and will adapt the API.

stschiff commented 2 months ago

Oh boy, that was actually a bug! I should have used takeExtensions anyway. takeExtension can never return .geno.gz, so that was actually a bug. Thanks for the unintentional catch!

stschiff commented 2 months ago

OK, I have adapted this now as you suggested. So now it's not a breaking change anymore. The code is also quite clear now:

 readGenoInput p = makeGenoInput (dropExtensions p) (takeExtensions p)
 makeGenoInput path ext
     | ext `elem` ["geno",    "snp",   "ind"] = Right (GenotypeFormatEigenstrat, path <.> "geno",    path <.> "snp",    path <.> "ind")
     | ext `elem` ["geno.gz", "snp.gz"      ] = Right (GenotypeFormatEigenstrat, path <.> "geno.gz", path <.> "snp.gz", path <.> "ind")
     | ext `elem` ["bed",     "bim",   "fam"] = Right (GenotypeFormatPlink,      path <.> "bed",     path <.> "bim",    path <.> "fam")
     | ext `elem` ["bed.gz",  "bim.gz"      ] = Right (GenotypeFormatPlink,      path <.> "bed.gz",  path <.> "bim.gz", path <.> "fam")
     | otherwise                              = Left $ "unknown file extension: " ++ ext

nevrome commented 2 months ago

I added a new test module specifically for CLI interface parsers in 0b6db85, because our golden tests do not cover them (interesting design decision I made back then... but here we are :shrug:).

I think it helped immediately to find and fix a mistake in -p. takeExtensions grabs the leading . before the extensions, and dropExtensions removes it. So it must be given explicitly in the extension strings in readGenoInput.