shandley / hecatomb

hecatomb is a virome analysis pipeline for analysis of Illumina sequence data
MIT License
53 stars 12 forks source link

Filtering out host (human) genome beforehand? #96

Open mhmism opened 11 months ago

mhmism commented 11 months ago

Thank you for this wonderful tool!

Should we filter out the host (human) genome before executing the pipeline e.g. using kneadData or fastp? the same also applies to the PhiX genomes?

I tried Hecatomb on one of my DNA shotgun metagenomics datasets, and I found that there is a large difference in the output with or without host DNA removal beforehand. Specifically, the number (and diversity) of viral sequences retrieved to was much higher when I did not remove the host (human) DNA before using Hecatomb. The other issue is that I found a large proportion of sequences was assigned to RNA viruses including ones that I should not normally see in my dataset, such as Human immunodeficiency virus. These RNA viruses were found with with or without prior host DNA removal, however, it was significantly higher when I included the dataset without removing the host DNA. This makes me think that the host DNA is mistakenly classified in my dataset. Also, I am not sure whether I should expect to find any RNA viruses when my dataset is mainly shotgun DNA metagenomics.

For more context, my dataset is a bulk shotgun metagenomics datasets (i.e. not viral enriched).

Thank you in advance!

beardymcjohnface commented 11 months ago

Hi, You shouldn't need to perform host removal as this step is performed by Hecatomb, but there will be a difference due to the way Hecatomb prepares the references for filtering. Viral-like sequences in the host are masked to avoid removing real viral sequences that happen to be similar, but this will result in host sequences that need to be filtered later. Hecatomb currently doesn't remove phix but I think this will change in the next version. I'm interested in hearing what your preference would be re: filtering as we've had this conversation several times about what approach would be best. I wouldn't expect to find many RNA viruses in a DNA metagenome, but you might still have hits to known RNA viruses if they share homology to DNA viruses in your sample.

mhmism commented 11 months ago

Thanks for your response. It would be great to remove the phix genome in the next version of Hecatomb. I will be looking forward to the next version. Regarding the filtering process, unfortunately, there is no easy answer. Based on what I saw in my toy dataset, I think lots of the host DNA reads were wrongly classified as RNA viruses (this was suggested from the large proportion of RNA viruses that were retrieved from a DNA dataset, so an unexpected behaviour). This may be a problem in short reads datasets, in general. On the other hand, you may also lose some DNA viruses if you filtered beforehand. I think if you would like to be more conservative and avoid false positives as much as possible, then removing host DNA beforehand might be needed. However, this still needs some benchmarking on synthetic datasets where a mix of microbial (including viral) and host short reads are included to reach more conclusive thoughts.

In addition, you may wish to include a feature to only search in the DNA vs RNA viral catalogue or both. This way, it may better suit the type of the dataset you are investigating.

I am curious to know your thoughts!

beardymcjohnface commented 11 months ago

Yes, I agree 100%. This misclassification of host DNA as RNA viruses is very typical. I like the idea of switching off searching for RNA viruses; I'll have to think of the best way to implement it as we want to do the same thing for phages.