naobservatory / mgs-pipeline

MIT License
4 stars 2 forks source link

Adding feature to ribocounts that saves non-rRNA read IDs #24

Closed lennijusten closed 9 months ago

lennijusten commented 9 months ago

RiboDetector takes a long time to run and we're currently only saving the number of rRNA reads. If we want to know the rRNA status of a read, we'd have to run it through RiboDetector again. I expect this information to be useful in the future, and it's of very little cost to save the read IDs.

Here, I add a feature that saves a text file of non-rRNA read IDs in a sample, with the IDs parsed by FastqGeneralIterator. The text file is saved to an AWS directory within each bioproject called ribopass-reads/.

Question: Is it worth compressing the text files before copying them to AWS?

lennijusten commented 9 months ago

@jeffkaufman I updated the RiboDetector feature from a ribocounts() to riboreads() paradigm. The new function saves read titles for rRNA reads to AWS in a directory called riboreads/. I'm starting to re-run the bioprojects now.

All AWS directories titled ribocounts/ or ribopass-reads/ can be deleted.

I also updated the prepare_dashboard files to the new paradigm as well, which means the dashboard/ribocounts/ dir can be deleted. Running prepare_dashboard.sh will pull files into a new dir called dashboard/riboreads/.