Open rjsorr opened 2 years ago
Hi @rjsorr and thanks for reporting about this. The behavior that you are seeing when you only use one negative control and one regular sample is normal, since the "EXCLUSIVE" and "SHARED" sets are useful and become different when you have more than one regular sample. So, the 2nd call the Recentrifuge is the right one, as you want to process all the negative controls and all the regular samples from your study, so that the robust contamination removal algorithm can work better. For a large number of complex samples then you may hit the problem that you mention: the html file is huge and it becomes intractable by the browsers. You still have the "extra" output (Excel, CSV, TSV files) working, but this is not comparable to the interactive pie charts. This problem is relatively recent, with larger and deeper metagenomic studies, and it will get worse for sure. So, I think it's time to include an option to shrink the html output, and even let that reduced html as the default behavior and the current comprehensive one as an option. Let me take a look at this and I'll come back!
Thanks and yes, I see the issue with the html getting worse. Actually, the solution could be to offer a reduced version of all pooled samples but a full version of each seperate sample with the pool (dataset). Myself, I would prefer to have a html file per sample with result both pre and post contamination removal, navigating the dropdown menu with many samples is actually quite cumbersome. Likewise, the same could be done for the "extra" file as to provide what is relevant to the sample and what is relevant to the pool. It is also getting difficult to open this as well unfortnately. For myself, a short log file would also be a great addition, simply listing the different contaminants, contamination levels, and if they should be removed or not, at present the log gives a lot of text to sort through to get to this answer.
Thanks, @rjsorr, for the feedback. After thinking on different options, I found that one of the already existing options for Recentrifuge's main script was the best choice. It's the --summary
or -u
flag in rcf
, which allows one to control the behavior of the summarization in Recentrifuge. By using --summary ONLY
or just -u ONLY
the code only outputs the input samples and the summarized samples, which reduces a lot the size of both the HTML and the "extra" file by skipping all of the generated samples before the summarization step. Anyway, I have changed a bit the format of the option in the commit that closes this issue. About related topics:
I would prefer to have a html file per sample with result both pre and post contamination removal, navigating the dropdown menu with many samples is actually quite cumbersome.
With the option above that problem should be better now. Anyway, one of the advantages of the Krona-style hierarchical pie is that you keep the context (root of the pie, selected taxon, search field, size option, etc) when moving from one sample to the other, and that is very useful when interactively exploring the dataset. If you were to generate just one HTML file per samples, you would lose all those features.
Likewise, the same could be done for the "extra" file as to provide what is relevant to the sample and what is relevant to the pool. It is also getting difficult to open this as well unfortnately.
Again, the -u ONLY
option should help with this. In this case, as this output is not intended for interactive use, you already have the option to generate one file per sample: use the --extra MULTICSV
or just -e MULTICSV
option and you will get that.
For myself, a short log file would also be a great addition, simply listing the different contaminants, contamination levels, and if they should be removed or not, at present the log gives a lot of text to sort through to get to this answer.
Yes, Recentrifuge's output is quite verbose, especially if you activate the --debug
flag. You can always parse it with you preferred command tool. In addition, the color codes should help to identify some messages. For example, the different types of contamination have different labels and colors, as the manual describes. Anyway, I agree that a nice addition would be to generate a file devoted to the output of the robust contamination removal algorithm, so that you can have that information in a more compact format. If you would like to contribute to the code with that feature I would be happy to review your PR. The PR should also add a test for this feature via retest
, since that feature would be an important addition that we would like to check with every commit. We can discuss this further if you are interested and have the time.
Cheers @khyox, gave it a go! worked for one of my datasets. The other unfortunately, is still giving a 360mb html file :)
Thanks @rjsorr. Did you get 360 MB even with the --summary ONLY
option? How many samples does that dataset have? I guess I will have to consider splitting the html output when a number of samples is reached —or another related metric taking also into account the complexity of a sample, since the number of taxa in each sample also increases the weight of the html file.
271 samples with 7 of these being negatives :)
rcf -n /media/ubuntu/Elements/reference_genomes/Recentrifuge/taxdump \
-k /media/ubuntu/Elements/NEWPIPELINE_MetaAIR/RAW_DATA/RECENTRIFUGE1819/18 \
-c 7 -o OUTPUT.html -u only -e MULTICSV -s KRAKEN -x 9606 > log.txt &
OK, thanks, @rjsorr. Datasets are now very easily reaching hundreds of samples.
Are you using the last release of recentrifuge? The option -u only
should throw an error now, with -u ONLY
as the current choice.
Anyway, I am reopening the issue to pursue another solution for larger datasets, very likely splitting the html file.
@khyox, running 1.7, but see 1.8 is on conda now :). cheers
btw @khyox, is it possible to get out abundace tabels (post contaminat removal) as output so I can run these through another program? what is the correct command for this?
Hi @rjsorr, sorry, I don't understand it well... do you mean "abundance tables" after the contamination removal? Any column in the output where you see _CTRL_
in the name is already showing a dataset that is free of contaminants as per the algorithm using the negative controls in your samples.
As it's relevant to this issue, version 1.13.0 of Recentrifuge introduces a substantial reduction in the HTML size.
Hi, I'm running recentrifuge with kraken2 with the main goal being removal of contaminants and for this I have negative control(s). However, what I'm seeing in my results is that the "Exclusive" and "control" for the real samples are identical, so recentrifuge is removing everything "shared" between the negative and real sample, rather than removing some and reducing the signal for others, as I was expecting (and also what I see from the paper)?
I am running on samples seperately and then samples pooled. I attach my code. running the latest version of the software
rcf -n /media/ubuntu/Elements/reference_genomes/Recentrifuge/taxdump -k 0_TRJE-N1_NEG.krk -k "$b".krk -c 1 -o ./FINAL/"$b"_after.html -d -s KRAKEN -x 9606 > ./FINAL/"$b"_after_log.txt
rcf -n /media/ubuntu/Elements/reference_genomes/Recentrifuge/taxdump \ -k /media/ubuntu/Elements/NEWPIPELINE_MetaAIR/RAW_DATA/clean_data_1_17b/RECENTRIFUGE \ -c 7 -o OUTPUT.html -d -s KRAKEN -x 9606 > log.txt &
FYI: I now see the second works better on pooled samples and the control and exclusive for the real samples are different. The problem however is that in some cases the html file is so large it is not possible to open?
regards
EXAMPLE.zip