iqbal-lab-org / make_prg

Code to create a PRG from a Multiple Sequence Alignment file
Other
21 stars 7 forks source link

Preparing v0.5.0 release #62

Closed leoisl closed 1 year ago

leoisl commented 1 year ago

Main changes:

  1. scikit-learn v0.24.2 now requires building, and I am not sure why... CI failed with python 3.8, 3.9, 3.10, 3.11. It was also failing for the same reason to build the docker image with python:3.10-slim, but somehow worked for python:3.8-slim. I tried several ways to fix this, but I didn't manage. This version of scikit-learn is not even that old, from April 2021. Anyway, due to these issues, I thought it would be a good time to upgrade it to the stable versions >=1.0.0, so this is one of the main changes in this PR. We then upgraded numpy to "^1.24.4" and scikit-learn to "^1.3.0". The issue is then that unit tests were getting stuck, and we could not run them. Fixing this required to upgrade pytest and run it with --forked, but then it was too slow to run the tests. Thus I've added pytest-xdist = "^3.3.1" to the dependencies, and we now run pytest also with -n 4 (i.e. 4 threads) so that is fast again;
  2. Simplified force_overwrite tests, 3 tests were redundant;
  3. I selected elkan as the kmeans algorithm in the updated scikit-learn version as tests were failing with the other algorithm. Weirdly the elkan kmeans changes the PRG/result for a single test case, the amira one. So I am a bit reluctant and wondering if this should be a prerelease until we test it more on real data...

This release properly handles Ns in the MSA, and in the denovo sequences. If fixes breaking issues reported by @Danderson123 in amira.

codecov[bot] commented 1 year ago

Codecov Report

Merging #62 (0faccb1) into dev (7fbbc8c) will increase coverage by 0.00%. The diff coverage is 100.00%.

:exclamation: Current head 0faccb1 differs from pull request most recent head dae2c1d. Consider uploading reports for the commit dae2c1d to get more accurate results

@@           Coverage Diff           @@
##              dev      #62   +/-   ##
=======================================
  Coverage   99.73%   99.73%           
=======================================
  Files          18       18           
  Lines        1525     1528    +3     
=======================================
+ Hits         1521     1524    +3     
  Misses          4        4           
Impacted Files Coverage Δ
make_prg/from_msa/cluster_sequences.py 100.00% <100.00%> (ø)
make_prg/update/denovo_variants.py 100.00% <100.00%> (ø)
make_prg/utils/seq_utils.py 100.00% <100.00%> (ø)
leoisl commented 1 year ago

The options for the kmeans algorithm in new scikit-learn version are lloyd or elkan. For lloyd, this unit test fails and the amira integration test. For elkan, just the amira integration test failed. So I went for the elkan, but I did not really know which option is actually the best nor checked in details why lloyd was failing in that unit test... If you think is better to switch to lloyd (the default) or if I should investigate this in more details, please tell me

iqbal-lab commented 1 year ago

Elkan vs Lloyd - the choice is about speed (elkan faster) and ram (elkan uses a bit more i think). i do not think we need to be perfectly identical, is ok to switch to elkan. But we still need to understand the amira integration fail, right?

leoisl commented 1 year ago

In the amira integration test, we try to build 3 PRGs: alsB, glpG and group_18516. glpG is built exactly as before, but this is not true for the other two. group_18516 has small differences that happens in 2 sites, which are just the order of alleles and alleles more nested. One example follows, the other is similar:

<before
7  9 CCCCCGC 10 TCCCCGC 9  8 CCCCCGA 8 TCCCCGA 7 GTGGCGCAGGC
>after
7 CCCCCGC 8 CCCCCGA 8 TCCCCGA 8 TCCCCGC 7 GTGGCGCAGGC

alsB is a bit more complicated. The first difference is that the site has one more allele:

>alsB
<before
 17 C 18 A 17 ACTGGGCGTCAGCGTTGATATTTTTGCCTC
>after
 17 CA 18 CG 18 AA 17 CTGGGCGTCAGCGTTGATATTTTTGCCTC 

if we compare before and after, both sites can spell CACTGGG... and AACTGGG... but only the second one can spell CGCTGGG... and I can find CGCTGGG... in the MSAs... so it seems the new version is actually doing the right thing.

The second difference is the second version having two more SNPs in a region than the first one, representing 4 alleles instead of 1:

<before
ATGTTGTCAACCAGCGCTTTTGCTGCCGCCGAATATGC
>after
ATGTTGTCAACCA 27 G 28 A 27 CGCTTTTGCT 29 G 30 A 29 CCGCCGAATATGC

The 1st allele, common to both versions, is the most common in the MSA (genotypes G and G in both SNPs, n=774). Two other alleles, only present in the second version, are also in the MSA but with very low frequency (genotyping G and A, n=2; genotyping A and G, n=1). A fourth allele is not present in the MSA at all, genotyping A and A, but this is expected as this SNP is required to make G and A an option.

I can continue this analysis, if you want, going to the 3rd, 4th, and so on differences on alsB, but the first evidences seems to show the new version is actually producing more correct PRGs than before...

iqbal-lab commented 1 year ago

looks good to me, am ok to merge