ucsd-cmi / qebil

BSD 3-Clause "New" or "Revised" License
0 stars 2 forks source link

Blast/sortmerna head of fastq file when amplicon target gene is ambiguous? #3

Open adswafford opened 3 years ago

adswafford commented 3 years ago

@antgonza when downloading wastewater studies with amplicon data, almost all of them are coming back ambiguous even when it's clear from the study page that it is 16S, e.g. https://www.ebi.ac.uk/ena/browser/view/PRJDB6476

What do you think about taking the head of the fastq data and testing to see if it is from a target gene we support (16S, ITS, 18S) instead of just calling it ambiguous?

antgonza commented 3 years ago

Yeah that study has no preparation/experiment metadata to accurate select the target gene. However, as you mentioned, it's in the study description. What about parsing that and assigning to 16S when there is only one? Like in the example.

adswafford commented 3 years ago

I suppose it's a matter of how much we trust the data in the files to match what's in the abstract/description? I think it's worth a shot along with some warning text that we can add to the analytical notes section. What's the best way to convey that info? We can't put it in the qebil_status file since then it won't just have the line 'complete'.

So just to confirm, the short term plan is to search the abstract and description for any of our know target genes, and if only one appears, and the library_strategy is amplicon, then we'll assign the inferred target gene from the abstract and force the user to sort it out after processing if incorrect.