Cannot reproduce the 'identification of “apicomplexa” vs. contamination contigs' workflow?

Edouard94 commented 5 months ago

Hello Julie,

I tried to rerun the PCA and HCA analyses with your assembly data, but I ran into some errors and reproduction issues.

First, I encountered the following error with get_kmers_occurences_memory_optimization_freq.py:

python get_kmers_occurences_memory_optimization_freq.py -k 5 -s raw_assembly_sup1kb.fasta -o genomes.sup1kb.kmers5
sequences:  13635
Traceback (most recent call last):
  File "/home/Shared/Illumina/Gregarine_ex_Acheta/V0304/11_fastp_trimmed/spades_output/Boisard_analysis/TestBoisardData/get_kmers_occurences_memory_optimization_freq.py", line 80, in <module>
    all_words= all_words+ kmers_count[ss1].keys()
TypeError: can only concatenate list (not "dict_keys") to list

I thus updated the concerned script line with:

all_words = all_words + list(kmers_count[ss1].keys())

After removing the ambiguous characters and running the R script, it appears that I did not obtain the same clusters as you did. Can you have a look at the R figures and I would appreciate any highlight to use the same workflow to assess my genome contaminations. Otherwise I would use blobtools, but your approach seemed very interesting.

Here is the kmer table imported in R: genomes.sup1kb.kmers5_noN.zip

Here is the final R figure I obtained and the kmer table seperated in 6 groups:

Thank you for your time to look into this!

Best wishes. Edouard

julieboisard commented 5 months ago

Hi Edouard!

Thanks for reaching out. For sure the error you encountered is due to python3 handling dict.keys() differently than python 2.7 (in which this script has been written). But your fix should do the trick!

I ran again the analysis and I believe I found the problem: the kmers table needs to display kmers frequencies rather than their count (as I can see your table is a count table).

That can be done by adding the flag -f here: get_kmers_occurences_memory_optimization_freq -k 5 -f

Then you can get the original plot:

I will edit the README so the -f can not be missed :)

Good luck with your project!

Edouard94 commented 5 months ago

Hello Julie,

Thanks a lot for your response, and for making your work and data very accessible and reproducible, it's really nice.

Yes the frequencies was the problem, I was also able to reproduce the expected clustering:

I also want to make sure to understand how to identify the hierarchical clustering with hclust, is this where the clusters were identified (I circled the nodes in red):

With my own data (trimmed 1kb gregarine assembly obtained from pure gametocysts DNA, from cricket) it looks I don't have as much contamination: pca

Thank you again for your work and your help!

Best wishes, Edouard

julieboisard / Marine_gregarines_genomes

Cannot reproduce the 'identification of “apicomplexa” vs. contamination contigs' workflow? #2