Closed leoisl closed 1 year ago
Merging #62 (0faccb1) into dev (7fbbc8c) will increase coverage by
0.00%
. The diff coverage is100.00%
.:exclamation: Current head 0faccb1 differs from pull request most recent head dae2c1d. Consider uploading reports for the commit dae2c1d to get more accurate results
@@ Coverage Diff @@
## dev #62 +/- ##
=======================================
Coverage 99.73% 99.73%
=======================================
Files 18 18
Lines 1525 1528 +3
=======================================
+ Hits 1521 1524 +3
Misses 4 4
Impacted Files | Coverage Δ | |
---|---|---|
make_prg/from_msa/cluster_sequences.py | 100.00% <100.00%> (ø) |
|
make_prg/update/denovo_variants.py | 100.00% <100.00%> (ø) |
|
make_prg/utils/seq_utils.py | 100.00% <100.00%> (ø) |
The options for the kmeans algorithm in new scikit-learn version are lloyd
or elkan
. For lloyd
, this unit test fails and the amira integration test. For elkan
, just the amira integration test failed. So I went for the elkan
, but I did not really know which option is actually the best nor checked in details why lloyd
was failing in that unit test... If you think is better to switch to lloyd
(the default) or if I should investigate this in more details, please tell me
Elkan vs Lloyd - the choice is about speed (elkan faster) and ram (elkan uses a bit more i think). i do not think we need to be perfectly identical, is ok to switch to elkan. But we still need to understand the amira integration fail, right?
In the amira integration test, we try to build 3 PRGs: alsB
, glpG
and group_18516
. glpG
is built exactly as before, but this is not true for the other two. group_18516
has small differences that happens in 2 sites, which are just the order of alleles and alleles more nested. One example follows, the other is similar:
<before
7 9 CCCCCGC 10 TCCCCGC 9 8 CCCCCGA 8 TCCCCGA 7 GTGGCGCAGGC
>after
7 CCCCCGC 8 CCCCCGA 8 TCCCCGA 8 TCCCCGC 7 GTGGCGCAGGC
alsB
is a bit more complicated. The first difference is that the site has one more allele:
>alsB
<before
17 C 18 A 17 ACTGGGCGTCAGCGTTGATATTTTTGCCTC
>after
17 CA 18 CG 18 AA 17 CTGGGCGTCAGCGTTGATATTTTTGCCTC
if we compare before and after, both sites can spell CACTGGG... and AACTGGG... but only the second one can spell CGCTGGG... and I can find CGCTGGG... in the MSAs... so it seems the new version is actually doing the right thing.
The second difference is the second version having two more SNPs in a region than the first one, representing 4 alleles instead of 1:
<before
ATGTTGTCAACCAGCGCTTTTGCTGCCGCCGAATATGC
>after
ATGTTGTCAACCA 27 G 28 A 27 CGCTTTTGCT 29 G 30 A 29 CCGCCGAATATGC
The 1st allele, common to both versions, is the most common in the MSA (genotypes G
and G
in both SNPs, n=774). Two other alleles, only present in the second version, are also in the MSA but with very low frequency (genotyping G
and A
, n=2; genotyping A
and G
, n=1). A fourth allele is not present in the MSA at all, genotyping A
and A
, but this is expected as this SNP is required to make G
and A
an option.
I can continue this analysis, if you want, going to the 3rd, 4th, and so on differences on alsB
, but the first evidences seems to show the new version is actually producing more correct PRGs than before...
looks good to me, am ok to merge
Main changes:
scikit-learn v0.24.2
now requires building, and I am not sure why... CI failed with python 3.8, 3.9, 3.10, 3.11. It was also failing for the same reason to build the docker image withpython:3.10-slim
, but somehow worked forpython:3.8-slim
. I tried several ways to fix this, but I didn't manage. This version ofscikit-learn
is not even that old, from April 2021. Anyway, due to these issues, I thought it would be a good time to upgrade it to the stable versions>=1.0.0
, so this is one of the main changes in this PR. We then upgraded numpy to"^1.24.4"
and scikit-learn to"^1.3.0"
. The issue is then that unit tests were getting stuck, and we could not run them. Fixing this required to upgradepytest
and run it with--forked
, but then it was too slow to run the tests. Thus I've addedpytest-xdist = "^3.3.1"
to the dependencies, and we now runpytest
also with-n 4
(i.e. 4 threads) so that is fast again;force_overwrite
tests, 3 tests were redundant;elkan
as thekmeans
algorithm in the updatedscikit-learn
version as tests were failing with the other algorithm. Weirdly theelkan kmeans
changes the PRG/result for a single test case, the amira one. So I am a bit reluctant and wondering if this should be a prerelease until we test it more on real data...This release properly handles
N
s in the MSA, and in the denovo sequences. If fixes breaking issues reported by @Danderson123 inamira
.