poseidon-framework / poseidon-hs

A toolset to work with modular genotype databases in the Poseidon format
https://poseidon-framework.github.io/#/trident
MIT License
7 stars 2 forks source link

Gzip reading-support #305

Closed stschiff closed 1 month ago

stschiff commented 2 months ago

This PR links to a newer version of sequence-formats, which adds support for reading gzipped genotype files. Here is what I wrote in the Changelog:

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 22.22222% with 7 lines in your changes missing coverage. Please review.

Project coverage is 60.40%. Comparing base (c401b90) to head (4507c06). Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
src/Poseidon/CLI/OptparseApplicativeParsers.hs 33.33% 1 Missing and 3 partials :warning:
src/Poseidon/Package.hs 0.00% 2 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #305 +/- ## ========================================== - Coverage 68.37% 60.40% -7.97% ========================================== Files 26 27 +1 Lines 3554 4031 +477 Branches 403 409 +6 ========================================== + Hits 2430 2435 +5 - Misses 721 1187 +466 - Partials 403 409 +6 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

nevrome commented 2 months ago

Brilliant that this is on the way! We should remember to enable this feature then also in the next schema version. Before I'll get into the proper review and testing I would like to clarify if this change really has to be breaking.

I understand that you can't do IO in OP.Parser, so -p will always only ever know about one file. But why can't we keep the previous logic of allowing any of ".geno", ".snp", ".ind", ".bed", ".bim", ".fam" for the extensions? Instead of takeExtension we could just use takeExtensions to get all extensions of the given file. And then match against these four groups:

  1. ".geno", ".snp", ".ind"
  2. ".bed", ".bim", ".fam"
  3. ".geno.gz", ".snp.gz", ".ind"
  4. ".bed.gz", ".bim.gz", ".fam"

Of course ".ind" and ".fam" would match the wrong groups if the genotype data is zipped. But that is a simple error case already covered in the CLI documentation:

If a gzipped genotype file is given, it is assumed that the corresponding .snp.gz or .bim.gz file is also gzipped (but not the .fam or .ind file)

Was this error case the only reason why we can not allow to give ".snp", ".ind", ".bim" and ".fam" any more? If so, then I'd argue a breaking change weighs heavier than that :thinking:

stschiff commented 2 months ago

Yes, I think I like this idea. Indeed makes it somewhat clearer for the user, and avoids the breaking change. So just to be clear: You would then say that if the user gives a *.fam or a *.ind file, we (somewhat arbitrarily) assume that the other two files are unzipped. Correct? Cause that is not covered by the current CLI documentation. The current CLI documentation goes the other way around: If you give a zipped .geno, .snp, .bed or .bum file, it assumes the other is also zipped. Anyway, I think I like the solution and will adapt the API.

stschiff commented 2 months ago

Oh boy, that was actually a bug! I should have used takeExtensions anyway. takeExtension can never return .geno.gz, so that was actually a bug. Thanks for the unintentional catch!

stschiff commented 2 months ago

OK, I have adapted this now as you suggested. So now it's not a breaking change anymore. The code is also quite clear now:

 readGenoInput p = makeGenoInput (dropExtensions p) (takeExtensions p)
 makeGenoInput path ext
     | ext `elem` ["geno",    "snp",   "ind"] = Right (GenotypeFormatEigenstrat, path <.> "geno",    path <.> "snp",    path <.> "ind")
     | ext `elem` ["geno.gz", "snp.gz"      ] = Right (GenotypeFormatEigenstrat, path <.> "geno.gz", path <.> "snp.gz", path <.> "ind")
     | ext `elem` ["bed",     "bim",   "fam"] = Right (GenotypeFormatPlink,      path <.> "bed",     path <.> "bim",    path <.> "fam")
     | ext `elem` ["bed.gz",  "bim.gz"      ] = Right (GenotypeFormatPlink,      path <.> "bed.gz",  path <.> "bim.gz", path <.> "fam")
     | otherwise                              = Left $ "unknown file extension: " ++ ext
nevrome commented 2 months ago

I added a new test module specifically for CLI interface parsers in 0b6db85, because our golden tests do not cover them (interesting design decision I made back then... but here we are :shrug:).

I think it helped immediately to find and fix a mistake in -p. takeExtensions grabs the leading . before the extensions, and dropExtensions removes it. So it must be given explicitly in the extension strings in readGenoInput.

stschiff commented 1 month ago

Excellent! I also thought that we should test this. Great idea to just start this with one simple interface case.

stschiff commented 1 month ago

OK, just to be clear, are you OK with merging this now?

nevrome commented 1 month ago

Well - the changelog and the command line documentation must be adjusted after the last changes.

May I ask (again?) why you only implemented reading and not also writing? The feature would have a lot more impact if writing was possible as well.

stschiff commented 1 month ago

Gzipped writing support requires additions to sequence-formats, and I prefer to get these features implemented one by one. We don't have to release before everything is done, of course, but I would like to get this PR in first, and then deal with the rest. I have another PR dealing with VCF-reading which is built on top of this one, which is the next step. After that, I will take care of writing.

stschiff commented 1 month ago

I have edited the Changelog. The CLI was already adapted to the new change.

nevrome commented 1 month ago

I tested it finally and it worked as expected. So this can be merged, imho. A major version jump is not necessary as this is not breaking any more.

stschiff commented 1 month ago

Ah right. Cool, I'll adapt the version and merge it into master then. Thanks!