iqbal-lab-org / make_prg

Code to create a PRG from a Multiple Sequence Alignment file
Other
21 stars 7 forks source link

Fix 27 #28

Closed bricoletc closed 2 years ago

bricoletc commented 2 years ago

Here are some changes to produce many fewer graphs with sequence ambiguity- when different paths spell the same sequence, causing problems for downstream read mapping and genotyping.

See #27 for the reduction in ambiguous genotype calls in gramtools across 26 genes in 14 samples, cut out 99% of them on this dataset!

My genotyping performance did not improve overall on this dataset (though it fixed the one in #27), so it may be good to try the changes here on another validation dataset. In fact for upcoming changes by @leoisl we need to do that IMO.

If happy with these changes we should probably version bump.

Fixed

So, if you have at least one sequence (e.g. `ATTTTTGA`, `num_unique_nongapped` = 1) occurring multiple times in the MSA but as different alignments (e.g. `AT-TTTTGA` and `ATTTT-TGA`, `num_unique_gapped` = 2), we dont do clustering/recursive prg construction. Reason is that the MSA tool did not find a coherent MSA in these cases, and this opens up the door for creating sub-prgs that spell the same sequences.
In our example:

AT-TTTTGA ATTTT-TGA


the prg builder could collapse the middle `TT` together, then you have `AT` vs `ATT` in the left sub-prg, and `TTGA` vs `TGA` onthe right sub-prg. And lo and behold, `ATTTTTGA` exists in two different paths. 
bricoletc commented 2 years ago

Test coverage report:

$ coverage run -m pytest
$ coverage report --omit="venv*,tests*"
Name                                      Stmts   Miss  Cover
-------------------------------------------------------------
make_prg/__init__.py                          6      2    67%
make_prg/from_msa/__init__.py                 4      0   100%
make_prg/from_msa/cluster_sequences.py      170      1    99%
make_prg/from_msa/interval_partition.py     124      2    98%
make_prg/from_msa/prg_builder.py            132     21    84%
make_prg/io_utils.py                         94     72    23%
make_prg/prg_encoder.py                      50      0   100%
make_prg/seq_utils.py                        55      4    93%
make_prg/subcommands/__init__.py              1      0   100%
make_prg/subcommands/output_type.py          24      0   100%
-------------------------------------------------------------
TOTAL                                       660    102    85%

The missing lines in prg_builder.py are all logging or to do with computing the max nesting level reached, so coverage is decent

bricoletc commented 2 years ago

Great review @mbhall88 thanks, i add some of the proposed changes to this pr and replied to the other comments

bricoletc commented 2 years ago

Thanks for your review @leoisl, i've added your suggestions, please merge this to dev if you're happy with my replies to the still open comments. I think we should then update make_prg to version 0.1.2, before your update PRs making up version 0.2.0

mbhall88 commented 2 years ago

I think we should then update make_prg to version 0.1.2, before your update PRs making up version 0.2.0

In keeping with semver, we need to bump the minor version (0.2.0). See discussion on https://github.com/iqbal-lab-org/make_prg/pull/26

So we either bump to 0.2.0 now, and then 0.3.0 when Leandro merges his stuff. Or we wait and do a massive merge and just bump to 0.2.0

bricoletc commented 2 years ago

Good point @mbhall88 , in that case I recommend we do 0.2.0 here. This will help stick to this version if the update command changes break/change/improve prg construction in any way, which I think remains to be tested.

leoisl commented 2 years ago

I agree with merging this PR and bumping to 0.2.0