Identify unmapped reads in BS DNA data

sr320 commented 2 years ago

Discussed in https://github.com/sr320/ceabigr/discussions/11

^{Originally posted by **sr320** February 11, 2022} There are a good deal of sequence reads that do not map to the genome. For various reasons it would be nice to know what those are. One might be if we see presence of consistent secondary taxa, that could be integrated into our analysis and serve as a proxy for data such as microbiome or symbionts. We have the `fq` files of unmapped reads. What are some robust approaches to identify these reads?

@kubu4 can you given MEGAN a whirl on unmapped BS reads?

Currently they are located at /gscratch/scrubbed/sr320/021022-BS-unmap/*unmapped_reads*.fq.gz

kubu4 commented 2 years ago

Just a heads up, since these are already bisulfite converted reads, the taxonomic assignments are probably going to be somewhat inaccurate.

kubu4 commented 2 years ago

I was starting to work on this and noticed that some of the files had time stamps listed as this morning. Are these still part of an active Mox job (i.e. should I wait to start this)?

sr320 commented 2 years ago

It can wait

sr320 commented 2 years ago

It can wait

kubu4 commented 2 years ago

Looks like bulk of reads (i.e. > 75%) map to Crassostrea (see images at bottom of post).

So, not sure why they're not getting mapped...

I'm assuming these came from Bismark? If yes, the manual provides a bit of insight into capturing unmapped reads. Additionally, a comparison of unmapped reads and ambiguous reads might allow a better grasp of what's happening:

--un

Write all reads that could not be aligned to the file _unmapped_reads.fq.gz in the output directory. Written reads will appear as they did in the input, without any translation of quality values that may have taken place within Bowtie or Bismark. Paired-end reads will be written to two parallel files with _1 and _2 inserted in their filenames, i.e. unmapped_reads_1.fq.gz and unmapped_reads_2.fq.gz. Reads with more than one valid alignment with the same number of lowest mismatches (ambiguous mapping) are also written to unmapped_reads.fq.gz unless --ambiguous is also specified.

--ambiguous

Write all reads which produce more than one valid alignment with the same number of lowest mismatches or other reads that fail to align uniquely to _ambiguous_reads.fq. Written reads will appear as they did in the input, without any of the translation of quality values that may have taken place within Bowtie or Bismark. Paired-end reads will be written to two parallel files with _1 and _2 inserted in their filenames, i.e. _ambiguous_reads_1.fq and _ambiguous_reads_2.fq. These reads are not written to the file specified with --un.

Wordcloud of taxonomic assignments of reads for each sample (Genus level):

Wordcloud of taxonomic assignments for all unmapped BSseq reads in each sample at the "Genus" level generated using MEGAN6 software, with the Genus "Crassostrea" as the most prominent of all the words present. In fact, all other words so small as to be illegible.

Phylogenetic tree of taxonomic assignments of reads for each sample (Genus level):

Phylogenetic tree of unmapped BSseq reads generated by MEGN6 software showing that most reads in all samples examined are assigned to the Genus "Crassostrea"

And, here's the tree at the species level to help confirm that there's not a bunch of C.gigas contamination:

Phylogenetic tree of unmapped BSseq reads generated by MEGN6 software showing that most reads in all samples examined are assigned to the Species"Crassostrea virginica"

sr320 / ceabigr

Identify unmapped reads in BS DNA data #22

Discussed in https://github.com/sr320/ceabigr/discussions/11