Contamination removal help and too large HTML files

rjsorr commented 2 years ago

Hi, I'm running recentrifuge with kraken2 with the main goal being removal of contaminants and for this I have negative control(s). However, what I'm seeing in my results is that the "Exclusive" and "control" for the real samples are identical, so recentrifuge is removing everything "shared" between the negative and real sample, rather than removing some and reducing the signal for others, as I was expecting (and also what I see from the paper)?

I am running on samples seperately and then samples pooled. I attach my code. running the latest version of the software

rcf -n /media/ubuntu/Elements/reference_genomes/Recentrifuge/taxdump -k 0_TRJE-N1_NEG.krk -k "$b".krk -c 1 -o ./FINAL/"$b"_after.html -d -s KRAKEN -x 9606 > ./FINAL/"$b"_after_log.txt

rcf -n /media/ubuntu/Elements/reference_genomes/Recentrifuge/taxdump \ -k /media/ubuntu/Elements/NEWPIPELINE_MetaAIR/RAW_DATA/clean_data_1_17b/RECENTRIFUGE \ -c 7 -o OUTPUT.html -d -s KRAKEN -x 9606 > log.txt &

FYI: I now see the second works better on pooled samples and the control and exclusive for the real samples are different. The problem however is that in some cases the html file is so large it is not possible to open?

regards

EXAMPLE.zip

khyox commented 2 years ago

Hi @rjsorr and thanks for reporting about this. The behavior that you are seeing when you only use one negative control and one regular sample is normal, since the "EXCLUSIVE" and "SHARED" sets are useful and become different when you have more than one regular sample. So, the 2nd call the Recentrifuge is the right one, as you want to process all the negative controls and all the regular samples from your study, so that the robust contamination removal algorithm can work better. For a large number of complex samples then you may hit the problem that you mention: the html file is huge and it becomes intractable by the browsers. You still have the "extra" output (Excel, CSV, TSV files) working, but this is not comparable to the interactive pie charts. This problem is relatively recent, with larger and deeper metagenomic studies, and it will get worse for sure. So, I think it's time to include an option to shrink the html output, and even let that reduced html as the default behavior and the current comprehensive one as an option. Let me take a look at this and I'll come back!

rjsorr commented 2 years ago

Thanks and yes, I see the issue with the html getting worse. Actually, the solution could be to offer a reduced version of all pooled samples but a full version of each seperate sample with the pool (dataset). Myself, I would prefer to have a html file per sample with result both pre and post contamination removal, navigating the dropdown menu with many samples is actually quite cumbersome. Likewise, the same could be done for the "extra" file as to provide what is relevant to the sample and what is relevant to the pool. It is also getting difficult to open this as well unfortnately. For myself, a short log file would also be a great addition, simply listing the different contaminants, contamination levels, and if they should be removed or not, at present the log gives a lot of text to sort through to get to this answer.

khyox commented 2 years ago

Thanks, @rjsorr, for the feedback. After thinking on different options, I found that one of the already existing options for Recentrifuge's main script was the best choice. It's the --summary or -u flag in rcf, which allows one to control the behavior of the summarization in Recentrifuge. By using --summary ONLY or just -u ONLY the code only outputs the input samples and the summarized samples, which reduces a lot the size of both the HTML and the "extra" file by skipping all of the generated samples before the summarization step. Anyway, I have changed a bit the format of the option in the commit that closes this issue. About related topics:

I would prefer to have a html file per sample with result both pre and post contamination removal, navigating the dropdown menu with many samples is actually quite cumbersome.

With the option above that problem should be better now. Anyway, one of the advantages of the Krona-style hierarchical pie is that you keep the context (root of the pie, selected taxon, search field, size option, etc) when moving from one sample to the other, and that is very useful when interactively exploring the dataset. If you were to generate just one HTML file per samples, you would lose all those features.

Likewise, the same could be done for the "extra" file as to provide what is relevant to the sample and what is relevant to the pool. It is also getting difficult to open this as well unfortnately.

Again, the -u ONLY option should help with this. In this case, as this output is not intended for interactive use, you already have the option to generate one file per sample: use the --extra MULTICSV or just -e MULTICSV option and you will get that.

For myself, a short log file would also be a great addition, simply listing the different contaminants, contamination levels, and if they should be removed or not, at present the log gives a lot of text to sort through to get to this answer.

Yes, Recentrifuge's output is quite verbose, especially if you activate the --debug flag. You can always parse it with you preferred command tool. In addition, the color codes should help to identify some messages. For example, the different types of contamination have different labels and colors, as the manual describes. Anyway, I agree that a nice addition would be to generate a file devoted to the output of the robust contamination removal algorithm, so that you can have that information in a more compact format. If you would like to contribute to the code with that feature I would be happy to review your PR. The PR should also add a test for this feature via retest, since that feature would be an important addition that we would like to check with every commit. We can discuss this further if you are interested and have the time.

rjsorr commented 2 years ago

Cheers @khyox, gave it a go! worked for one of my datasets. The other unfortunately, is still giving a 360mb html file :)

khyox commented 2 years ago

Thanks @rjsorr. Did you get 360 MB even with the --summary ONLY option? How many samples does that dataset have? I guess I will have to consider splitting the html output when a number of samples is reached —or another related metric taking also into account the complexity of a sample, since the number of taxa in each sample also increases the weight of the html file.

rjsorr commented 2 years ago

271 samples with 7 of these being negatives :)

rcf -n /media/ubuntu/Elements/reference_genomes/Recentrifuge/taxdump \
-k /media/ubuntu/Elements/NEWPIPELINE_MetaAIR/RAW_DATA/RECENTRIFUGE1819/18 \
-c 7 -o OUTPUT.html -u only -e MULTICSV -s KRAKEN -x 9606 > log.txt &

khyox commented 2 years ago

OK, thanks, @rjsorr. Datasets are now very easily reaching hundreds of samples.

Are you using the last release of recentrifuge? The option -u only should throw an error now, with -u ONLY as the current choice.

Anyway, I am reopening the issue to pursue another solution for larger datasets, very likely splitting the html file.

rjsorr commented 2 years ago

@khyox, running 1.7, but see 1.8 is on conda now :). cheers

rjsorr commented 2 years ago

btw @khyox, is it possible to get out abundace tabels (post contaminat removal) as output so I can run these through another program? what is the correct command for this?

khyox commented 2 years ago

Hi @rjsorr, sorry, I don't understand it well... do you mean "abundance tables" after the contamination removal? Any column in the output where you see _CTRL_ in the name is already showing a dataset that is free of contaminants as per the algorithm using the negative controls in your samples.

khyox commented 10 months ago

As it's relevant to this issue, version 1.13.0 of Recentrifuge introduces a substantial reduction in the HTML size.

khyox / recentrifuge

Contamination removal help and too large HTML files #38