greenelab / 2022-microberna

A pipeline to generate a compendia of bacterial and archaeal RNA-seq data
BSD 3-Clause "New" or "Revised" License
4 stars 1 forks source link

recording puffaligner stdout #4

Open taylorreiter opened 2 years ago

taylorreiter commented 2 years ago
ln -s ../../../inputs/gtdb_genomes/s__Bacillus_pumilus/*fna.gz bp_genomes/
cat *fna.gz > all_bp_genomes.fna.gz
cd ..
~/github/pufferfish/build/src/pufferfish index -r bp_genomes/all_bp_genomes.fna.gz -o bp_index -p 9 -k 31
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR750/006/SRR7508446/SRR7508446_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR750/006/SRR7508446/SRR7508446_2.fastq.gz
~/github/pufferfish/build/src/pufferfish align -1 SRR7508446_1.fastq.gz -2 SRR7508446_2.fastq.gz -i bp_index/ -o SRR7508446.sam
saw 10600106 reads : pe / read = 63.2961 : se / read = 0 [2022-01-21 13:32:58.937] [console] [info] flushing output queue.
[2022-01-21 13:32:58.937] [console] [info] Done mapping reads.
[2022-01-21 13:32:58.937] [console] [info]

[2022-01-21 13:32:58.937] [console] [info] =====
[2022-01-21 13:32:58.937] [console] [info] Observed 10680167 reads
[2022-01-21 13:32:58.937] [console] [info] Rate of Fragments with at least one found k-mer: 94.04%
[2022-01-21 13:32:58.937] [console] [info] Discordant Rate: 19.27%
[2022-01-21 13:32:58.937] [console] [info] Total reads Mapped: 9242836
[2022-01-21 13:32:58.937] [console] [info] Mapping rate : 86.54%
[2022-01-21 13:32:58.937] [console] [info] Average # hits per read : 58.4831
[2022-01-21 13:32:58.937] [console] [info] Total # of alignments : 624609180
[2022-01-21 13:32:58.937] [console] [info] Total # of orphans : 2057924
[2022-01-21 13:32:58.937] [console] [info] Total # of pe hits : 676007867
[2022-01-21 13:32:58.937] [console] [info] Total # of total Hits : 7184912
[2022-01-21 13:32:58.937] [console] [info] Max multimapping group : 200
[2022-01-21 13:32:58.937] [console] [info] Total number of alignment attempts : 1216425000
[2022-01-21 13:32:58.937] [console] [info] Number of skipped alignments because of cache hits : 917057701
[2022-01-21 13:32:58.937] [console] [info] Number of skipped alignments because of perfect chains : 91182414
[2022-01-21 13:32:58.937] [console] [info] Number of cigar strings which are fixed: 0
[2022-01-21 13:32:58.937] [console] [info] =====
Elapsed time: 4726.16s
-rw-r--r-- 1 tereiter tereiter 289G Jan 21 13:32 SRR7508446.sam
taylorreiter commented 2 years ago

So I don't think that pufferfish is going to work in this case. Pufferfish does the mapping against all reference genomes remarkably well, but by default it outputs the top 200 alignments for each read, so for portions of the genome that are repetitive between strains, we'll get many mappings reported. Salmon is then used downstream to quantify the best alignment with it's EM algorithm, but salmon is designed to quantify reads mapped against transcriptomes, and in this case we would have mapped against the genome. so we sort of end up back at square one, needing to use salmon without being able to detect spanning reads.

I dug deeper into how unmapped reads are reported...one thing we could do is take advantage of the information that is output with the unmapped reads. For single end reads (which I plan to quasi-map PE reads as SE bc operons), the default is that all reads that don't map against the transcriptome are labelled as U for unmapped. But, if we provide a decoy (a set of sequences to which reads that map better than they may map against the transcriptome, thereby discarding the transcriptome mapping; a strategy to avoid poor mappings that belong to other sequences than those represented in the transcriptome), unmapped reads can also be labelled as D for decoy. Soooo i we provide all of the intergenic segments of the genome as decoys, then reads that map better to intergenic regions won't be quantified, and they will be labelled as D. We just need to be careful to construct the decoy to not include e.g. real coding things that don't code for proteins; e.g. trnas, rrnas, etc.

This still would mean that downstream analysis would be required to identify where the intergenic sequences mapped, but it at least gives us a way to start getting at them.