njdbickhart / RumenLongReadASM

Scripts, notes and commands to accompany our Rumen Metagenome assembly manuscript
GNU General Public License v3.0
1 stars 0 forks source link

Error parsing malformed sam at record: #4

Open ecpierce opened 2 years ago

ecpierce commented 2 years ago

Hi @njdbickhart ! Awesome tool. When I run the wrapper script, I get a bunch of "Error parsing malformed sam at record: " errors. Otherwise, everything seems normal and I get the expected outputs. Is this something I should be concerned about?

python3 viralAssociationPipeline.py -a We_unbin_unmap_priority_1000.fasta -g we_mgepredictions.fasta -b We_mega_blo.txt -i We_3_6_hictomega.bam -v we_mgepredictions_length.txt -o we_3_6_rumen -l 4718.We6--We6.fastq Loaded 691 viral contigs for analysis

[M::mm_idx_gen::0.6881.00] collected minimizers [M::mm_idx_gen::0.8461.37] sorted minimizers [M::main::0.8461.37] loaded/built the index for 691 target sequence(s) [M::mm_mapopt_update::0.9181.34] mid_occ = 12 [M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 691 [M::mm_idx_stat::0.9631.32] distinct minimizers: 1867596 (86.32% are singletons); average occurrences: 1.183; average spacing: 7.567 [M::worker_pipeline::15.8201.93] mapped 37588 sequences [M::worker_pipeline::22.1842.32] mapped 37325 sequences [M::worker_pipeline::29.1852.54] mapped 37111 sequences [M::worker_pipeline::35.9332.69] mapped 37036 sequences [M::worker_pipeline::42.6722.78] mapped 36975 sequences [M::worker_pipeline::49.2022.84] mapped 37128 sequences [M::worker_pipeline::55.9012.90] mapped 37163 sequences [M::worker_pipeline::62.3992.94] mapped 37140 sequences [M::worker_pipeline::69.1012.97] mapped 37292 sequences [M::worker_pipeline::75.7413.00] mapped 37171 sequences [M::worker_pipeline::82.7243.02] mapped 37053 sequences [M::worker_pipeline::89.2713.04] mapped 36988 sequences [M::worker_pipeline::96.0683.06] mapped 36983 sequences [M::worker_pipeline::102.8593.07] mapped 36864 sequences [M::worker_pipeline::109.6823.09] mapped 36800 sequences [M::worker_pipeline::116.5693.09] mapped 36838 sequences [M::worker_pipeline::120.3263.09] mapped 20261 sequences [M::main] Version: 2.17-r941 [M::main] CMD: /home/e1pierce/anaconda3/bin/minimap2 -x map-pb we_mgepredictions.fasta 4718.We6--We6.fastq [M::main] Real time: 120.356 sec; CPU: 371.259 sec; Peak RSS: 1.023 GB Identified 34331 overlapping reads with a mean length of 5424.206781043372 for 1 unique viral contigs Generating overhangs in we_3_6_rumen.ovlps.fa fasta file Aligning overhangs to full genome fasta file [M::mm_idx_gen::12.9761.33] collected minimizers [M::mm_idx_gen::15.2681.57] sorted minimizers [M::main::15.2681.57] loaded/built the index for 16225 target sequence(s) [M::mm_mapopt_update::16.7201.52] mid_occ = 45 [M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 16225 [M::mm_idx_stat::17.1631.51] distinct minimizers: 28303869 (49.56% are singletons); average occurrences: 2.108; average spacing: 7.600 [M::worker_pipeline::36.8472.21] mapped 29977 sequences [M::main] Version: 2.17-r941 [M::main] CMD: /home/e1pierce/anaconda3/bin/minimap2 -x map-pb We_unbin_unmap_priority_1000.fasta we_3_6_rumen.ovlps.fa [M::main] Real time: 37.048 sec; CPU: 81.640 sec; Peak RSS: 2.176 GB Found successful associations for 253 out of 691 viral contigs and 49664 ambiguously aligned reads Finished long read overlap! Generating a Hi-C link table from alignment file: We_3_6_hictomega.bam Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Error parsing malformed sam at record: Using a threshold of 218.01372593230863 reads to select Hi-C links to host contigs Found valid Hi-C link associations for 548 viral contigs out of 30370 original candidates. There were an average of 55.41970802919708 host contig associations in this dataset Finished Hi-C link processing! Generating final network table Identified 31 overlapping viral-host associations. Loading taxonomic information from file We_mega_blo.txt Filled in taxonomic entries for 342 associations out of 342 possible entries Final associations in table we_3_6_rumen.final.tab

njdbickhart commented 2 years ago

Thank you for the detailed report! I just took a look and it seems like the problem may be with the BAM file you are using for your Hi-C data. The script skips and then prints out BAM/SAM file lines that have fewer than 7 segments. So this shouldn't impact your final results, but it does highlight some issues with the input BAM file itself that might need a look-over.

The report seems to indicate that the line is empty in the BAM file though. Did you align the Hi-C links with BWA or Minimap2?

ecpierce commented 2 years ago

Thanks for your response @njdbickhart ! I aligned with BWA like this: bwa mem -A1 -B4 -E50 -L0 -t 16 index 1.fq.gz 2.fq.gz > hicaligned.sam

and then used samtools view to make the bam samtools view -b -@ 16 hicaligned.sam > hicaligned.bam

ecpierce commented 2 years ago

Hi @njdbickhart related to the post above- I have some questions about aligning HiC with bwa:

  1. Were you concerned about results being skewed by reads mapping to multiple locations since bwa chooses one of the locations randomly, and did you do anything to deal with this (like filtering out reads that map to multiple locations)?

  2. I think I need to redo my mapping with the -5SP flag, which I didn't use before. But curious what you think about the other options I was using based on the read mapping recommendations here https://hicexplorer.readthedocs.io/en/latest/content/example_usage.html

njdbickhart commented 2 years ago

Sorry for the delay in response! Worries about read mappings to repetitive/orthologous sequence are valid concerns, and this script does not directly filter them. However, I suspect that my stringent read count thresholds for identifying inter-contig links may ultimately obviate this problem.

If it is desired, I can add a filtration flag that removes all Hi-C read alignments that have a MAPQ value less than a user-supplied value. This should eliminate most problematic alignments that could result from ambiguity around repetitive regions. Does that sound reasonable?

On your BWA alignment parameters, are you trying to force soft-clipping of your reads? The -A and -B flags appear to be the defaults, but your gap extension penalty is quite high. Right now, the script does not recognize soft clipping and may count secondary alignments more than once.