poseidon-framework / poseidon-hs

A toolset to work with modular genotype databases in the Poseidon format
https://poseidon-framework.github.io/#/trident
MIT License
7 stars 2 forks source link

switch from String to Text in the representation of .janno files #212

Closed nevrome closed 1 year ago

nevrome commented 1 year ago

This PR replaces String in .janno files with Text. This has various consequences:

When I went through the code for list and summarize I did some minor refactoring. In list I changed the error management for the -j option. In summarize I simplified the frequency reporting code. The output has now some ugly quotes, but the code is more easy to read. A fair tradeoff, imho.

Finally I also wanted to address the issue that triggered this entire PR: #213. I introduced the new function cleanInput in Janno.hs which trims white spaces before the actual CSV parsing starts. This is an expensive operation, as I decided to decode the raw input ByteStrings to Text, manipulate them and then encode them again, before they are yet again parsed depending on the output type. Removing UTF-8 (!) white spaces in ByteString was not reliable as some of them are sometimes (?) encoded by two characters (e.g. \194\160 for the No-Break Space \160). ByteString's dropWhile and dropWhileEnd only allow to handle single Chars. Data.Text's strip ALSO only trims by single Chars with Data.Char.isSpace, but does it somehow (?) more reliably. I did not have to consider \194 as a separate case with this solution. That's why I accepted the performance cost of decoding and encoding. Feel free to suggest something else. But what I implemented now seems to fix #213, which should be the minimum goal.

codecov-commenter commented 1 year ago

Codecov Report

Base: 64.35% // Head: 64.99% // Increases project coverage by +0.64% :tada:

Coverage data is based on head (a6dbfbf) compared to base (3ee14e9). Patch coverage: 50.87% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #212 +/- ## ========================================== + Coverage 64.35% 64.99% +0.64% ========================================== Files 18 18 Lines 2233 2274 +41 Branches 249 252 +3 ========================================== + Hits 1437 1478 +41 + Misses 547 544 -3 - Partials 249 252 +3 ``` | [Impacted Files](https://codecov.io/gh/poseidon-framework/poseidon-hs/pull/212?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=poseidon-framework) | Coverage Δ | | |---|---|---| | [src/Poseidon/CLI/Forge.hs](https://codecov.io/gh/poseidon-framework/poseidon-hs/pull/212?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=poseidon-framework#diff-c3JjL1Bvc2VpZG9uL0NMSS9Gb3JnZS5ocw==) | `59.83% <0.00%> (+0.15%)` | :arrow_up: | | [src/Poseidon/CLI/List.hs](https://codecov.io/gh/poseidon-framework/poseidon-hs/pull/212?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=poseidon-framework#diff-c3JjL1Bvc2VpZG9uL0NMSS9MaXN0Lmhz) | `40.49% <16.66%> (-1.31%)` | :arrow_down: | | [src/Poseidon/CLI/Summarise.hs](https://codecov.io/gh/poseidon-framework/poseidon-hs/pull/212?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=poseidon-framework#diff-c3JjL1Bvc2VpZG9uL0NMSS9TdW1tYXJpc2UuaHM=) | `70.37% <33.33%> (+4.24%)` | :arrow_up: | | [src/Poseidon/Janno.hs](https://codecov.io/gh/poseidon-framework/poseidon-hs/pull/212?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=poseidon-framework#diff-c3JjL1Bvc2VpZG9uL0phbm5vLmhz) | `65.64% <97.50%> (-0.12%)` | :arrow_down: | | [src/Poseidon/Package.hs](https://codecov.io/gh/poseidon-framework/poseidon-hs/pull/212?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=poseidon-framework#diff-c3JjL1Bvc2VpZG9uL1BhY2thZ2UuaHM=) | `75.52% <100.00%> (-0.01%)` | :arrow_down: | | [src/Poseidon/CLI/Validate.hs](https://codecov.io/gh/poseidon-framework/poseidon-hs/pull/212?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=poseidon-framework#diff-c3JjL1Bvc2VpZG9uL0NMSS9WYWxpZGF0ZS5ocw==) | `52.38% <0.00%> (ø)` | | | [src/Poseidon/SecondaryTypes.hs](https://codecov.io/gh/poseidon-framework/poseidon-hs/pull/212?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=poseidon-framework#diff-c3JjL1Bvc2VpZG9uL1NlY29uZGFyeVR5cGVzLmhz) | `50.49% <0.00%> (+0.49%)` | :arrow_up: | | [src/Poseidon/CLI/Survey.hs](https://codecov.io/gh/poseidon-framework/poseidon-hs/pull/212?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=poseidon-framework#diff-c3JjL1Bvc2VpZG9uL0NMSS9TdXJ2ZXkuaHM=) | `90.21% <0.00%> (+1.08%)` | :arrow_up: | | ... and [2 more](https://codecov.io/gh/poseidon-framework/poseidon-hs/pull/212?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=poseidon-framework) | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=poseidon-framework). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=poseidon-framework)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

nevrome commented 1 year ago

Ok - so @stschiff and I decided to abandon the transformation to Text for now. We had a wrong understand of what a Char actually is in Haskell. I will prepare another PR with the useful changes introduced here, including the fix for #213.

nevrome commented 1 year ago

See PR #220