t-neumann / slamdunk

Streamlining SLAM-seq analysis with ultra-high sensitivity
GNU Affero General Public License v3.0
37 stars 22 forks source link

Slamdunk Count: Paired-end slam-seq mapping with ngm #86

Closed penguinmeow closed 5 months ago

penguinmeow commented 3 years ago

Hi Tobias,

Could you kindly help me with the following 2 questions?

  1. The slam-seq library was prepared with Lexogen kits and should used single-end sequencing. But the sequencing center returned paired-end sequencing. Is it a big deal if I just analyze the R1.fastq rather than both R1 and R2?
  2. I try to map paired-end fastq files to reference genome by NextGenMap 0.5.5. Then using slamdunk filter, snp and count to complete the pipeline. It runs well in slamdunk filter and snp parts while error occurs in count:

    Exited with exit code 1. Running slamDunk tcount for 1 files (1 threads) Traceback (most recent call last): File "/home/xzhen/.conda/envs/slamseq/bin/slamdunk", line 10, in sys.exit(run()) File "/home/xzhen/.conda/envs/slamseq/lib/python3.8/site-packages/slamdunk/slamdunk.py", line 516, in run results = Parallel(n_jobs=n, verbose=verbose)(delayed(runCount)(tid, args.bam[tid], args.ref, args.bed, args.maxLength, args.minQual, args.conversionThreshold, outputDirectory, snpDirectory, vcfFile) for tid in range(0, len(args.bam))) File "/home/xzhen/.conda/envs/slamseq/lib/python3.8/site-packages/joblib/parallel.py", line 1048, in call if self.dispatch_one_batch(iterator): File "/home/xzhen/.conda/envs/slamseq/lib/python3.8/site-packages/joblib/parallel.py", line 866, in dispatch_one_batch self._dispatch(tasks) File "/home/xzhen/.conda/envs/slamseq/lib/python3.8/site-packages/joblib/parallel.py", line 784, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "/home/xzhen/.conda/envs/slamseq/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async result = ImmediateResult(func) File "/home/xzhen/.conda/envs/slamseq/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 572, in init self.results = batch() File "/home/xzhen/.conda/envs/slamseq/lib/python3.8/site-packages/joblib/parallel.py", line 262, in call return [func(*args, *kwargs) File "/home/xzhen/.conda/envs/slamseq/lib/python3.8/site-packages/joblib/parallel.py", line 262, in return [func(args, **kwargs) File "/home/xzhen/.conda/envs/slamseq/lib/python3.8/site-packages/slamdunk/slamdunk.py", line 193, in runCount maxLength = estimateMaxReadLength(bam) File "/home/xzhen/.conda/envs/slamseq/lib/python3.8/site-packages/slamdunk/utils/misc.py", line 99, in estimateMaxReadLength minLength = min(minLength, read.query_length + read.get_tag("XA")) File "pysam/libcalignedsegment.pyx", line 2399, in pysam.libcalignedsegment.AlignedSegment.get_tag File "pysam/libcalignedsegment.pyx", line 2438, in pysam.libcalignedsegment.AlignedSegment.get_tag KeyError: "tag 'XA' not present"

I followed your suggestion in #25

'ngm -b -r reference.fasta -q fastq-file -t threads --slam-seq 2 -5 12 -l --rg-id readgroup --rg-sm samplename:sampletype:sampletime -n 100 --strata -o out.bam'

And run the whole pipeline like this:

`#!/bin/bash module load conda3/5.1.0 source activate slamseq

for 1953895

map

ngm -b -r GRCm38.p6.genome.fa -1 pipe/1953895_WT_BMDM_1_S110_L004_R1_001.fastq.gz -2 pipe/1953895_WT_BMDM_1_S110_L004_R2_001.fastq.gz -t 16 --slam-seq 2 -5 12 -l --rg-id 1953895 --rg-sm 1953895_WT_BMDM_201120 -n 100 --strata -o pipe/1953895_WT_BMDM_1_S110_L004_R1_001.bam

filter

slamdunk filter -o pipe -b UTR/3pUTR.bed -t 1 pipe/1953895_WT_BMDM_1_S110_L004_R1_001.bam

snp

slamdunk snp -o pipe -r GRCm38.p6.genome.fa -t 16 pipe/1953895_WT_BMDM_1_S110_L004_R1_001_filtered.bam

count

slamdunk count -o pipe -s pipe -v 1953895_WT_BMDM_1_S110_L004_R1_001_filtered_snp.vcf -r GRCm38.p6.genome.fa -b UTR/3pUTR.bed -t 1 pipe/1953895_WT_BMDM_1_S110_L004_R1_001_filtered.bam` But error occurs:

The output (if any) follows [MAIN] NextGenMap 0.5.5 [MAIN] Startup : x64 (build Jul 3 2020 02:47:43) [MAIN] Starting time: 2020-11-20.15:00:39 [CONFIG] Parameter: --affine 0 --argos_min_score 0 --bam 1 --bin_size 2 --block_multiplier 2 --broken_pairs 0 --bs_cutoff 6 --bs_mapping 0 --cpu_threads 16 --dualstrand 1 --fast 0 --fast_pairing 0 --force_rlength_check 0 --format 2 --gap_extend_penalty 5 --gap_read_penalty 20 --gap_ref_penalty 20 --hard_clip 0 --keep_tags 0 --kmer 13 --kmer_min 0 --kmer_skip 2 --local 1 --match_bonus 10 --match_bonus_tc 2 --match_bonus_tt 10 --max_cmrs 2147483647 --max_equal 1 --max_insert_size 1000 --max_polya -1 --max_read_length 0 --min_identity 0.650000 --min_insert_size 0 --min_mq 0 --min_residues 0.500000 --min_score 0.000000 --mismatch_penalty 15 --mode 0 --no_progress 0 --no_unal 0 --ocl_threads 1 --output pipe/1953895_WT_BMDM_1_S110_L004_R1_001.bam --overwrite 1 --pair_score_cutoff 0.900000 --paired 1 --parse_all 1 --pe_delimiter / --qry1 pipe/1953895_WT_BMDM_1_S110_L004_R1_001.fastq.gz --qry2 pipe/1953895_WT_BMDM_1_S110_L004_R2_001.fastq.gz --qry_count -1 --qry_start 0 --ref GRCm38.p6.geno me.fa --ref_mode -1 --rg_id 1953895 --rg_sm 1953895_WT_BMDM_201120 --sensitive 0 --silent_clip 0 --skip_mate_check 0 --skip_save 0 --slam_seq 2 --step_count 4 --strata 1 --topn 100 --trim5 12 --update_check 0 --very_fast 0 --very_sensitive 0 [NGM] Opening for output (BAM): pipe/1953895_WT_BMDM_1_S110_L004_R1_001.bam [SEQPROV] Reading encoded reference from GRCm38.p6.genome.fa-enc.2.ngm [SEQPROV] Reading 2755 Mbp from disk took 1.15s [PREPROCESS] Reading RefTable from GRCm38.p6.genome.fa-ht-13-2.3.ngm [PREPROCESS] Reading from disk took 5.00s [PREPROCESS] Max. k-mer frequency set so 1014! [INPUT] Input is paired end data. [INPUT] Opening file pipe/1953895_WT_BMDM_1_S110_L004_R1_001.fastq.gz for reading [INPUT] Opening file pipe/1953895_WT_BMDM_1_S110_L004_R2_001.fastq.gz for reading [INPUT] Input is Fastq [INPUT] Estimating parameter from data [INPUT] Average read length: 89 (min: 89, max: 90) [INPUT] Corridor width: 18 [INPUT] Average kmer hits pro read: 15.960559 [INPUT] Max possible kmer hit: 25 [INPUT] Estimated sensitivity: 0.638422 [INPUT] Estimating parameter took 37.089s [INPUT] Input is Fastq [INPUT] Input is Fastq [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [OPENCL] Available platforms: 1 [OPENCL] AMD Accelerated Parallel Processing [OPENCL] Selecting OpenCl platform: AMD Accelerated Parallel Processing [OPENCL] Platform: OpenCL 1.2 AMD-APP (1214.3) [OPENCL] 1 CPU device found. [OPENCL] Device 0: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz (Driver: 1214.3 (sse2,avx)) [OPENCL] 28 CPU cores available. [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [Progress] Mapped: 1, CMR/R: 0, CS: 0 (0), R/S: 0, Time: 0.00 0.00 0.00, Pairs: 0.00 0.00 [OUTPUT] TopN > 1 is currently not supported for paired end reads. This error is fatal. Quitting...

Seems like we cannot run ngm with the same parameter set for paired-end fastqs. Is there another way to fix the "tag 'XA' not present" error in Count?

Thank you so much! And happy weekend~

Regards, Zhen

LasseLehtonenDK commented 3 years ago

Hi Zhen,

I ran into the same error as you trying to use the paired-end workaround that you mentioned. It seems that the problem is with estimating max read-length. By specifying max read-length (-l parameter) in the count dunk, I was able to get past this issue. The same is true for all Alleyoop modules that require max-read length estimation (utrrates, snpeval, tcperreadpos, tcperutrpos). Maybe this will help you as well.

Best regards, Lasse

t-neumann commented 3 years ago

Hi @penguinmeow

sorry for the slow response. I think you can first try it with paired-end alignments. For this I guess you need to get rid of the -l and -n 100 parameter and try again.

If you still get errors afterwards, I would then go for merging both read sets and consider them as single-end reads.

The -l parameter for the slamdunk modules you only need once you have a properly aligned bam but from what I can see you already failed at the mapping stage right?

penguinmeow commented 3 years ago

Hi @penguinmeow

sorry for the slow response. I think you can first try it with paired-end alignments. For this I guess you need to get rid of the -l and -n 100 parameter and try again.

If you still get errors afterwards, I would then go for merging both read sets and consider them as single-end reads.

The -l parameter for the slamdunk modules you only need once you have a properly aligned bam but from what I can see you already failed at the mapping stage right?

Hi Tobias, Thanks for your suggestion and sorry for the late response. Yes, previously, I mapped R1 and R2 to reference genome with the paired-end mode of NextGenMap and use SLAMdunk filter, snp, count modules for further analysis. And it turn's out to have the KeyError: "tag 'XA' not present" in count module. To get rid of the error, I followed the instruction in the issue 25, in which I add -l and -n 100 in the ngm parameter, which turns out to give an error of "TopN > 1 is currently not supported for paired end reads."

I remove the -l and -n 100 in the ngm parameters and re-run the whole pipeline of slamdunk today, but the same error that KeyError: "tag 'XA' not present" occurs again in the count module. The script and err log are attached in case you might wanna take a look. pipe_script.txt pipeline_err.txt

Thank you so much and wish you a happy Christmas~ Zhen

t-neumann commented 3 years ago

Hi Zhen,

sorry for not being clear - the -l should only be removed for the ngm call - for the actual slamdunk commands you should supply the read-length with -l readlength to avoid the length estimation step which uses the not-present XA tag. You follow me?

Happy xmas to you to!

drowsygoat commented 3 years ago

This is kind of related, so I decided to continue in this thread, it may be moved to a new thread of deemed off-topic.

In the paired-end data, there is a problem of potential T->C conversions within the overlaps of read pairs. Such hits would be counted twice, even though they should not. One way to overcome the problem is to find such mates, and set qualities of any of the mates in the BAM file to zero (or < than the threshold used for counting, so that they are not counted) - the trick used for bisulfite sequencing.

I am also toying with another, a bit less elegant approach, which is to merge overlapping pairs to obtain a single long read for each of them, and then do the analysis as if it was SE data. The problem is that the reads will be of different lengths, so theoretically, the counts should be weighted to take these differences into account (a longer read is more likely to be converted; this can get nuanced even more, as conversion probability depends on T content of a read).

But leaving the nuances aside, I would love to hear from you about how to tackle the problem of overlapping pairs.

@t-neumann Does slamdunk provide a way to tag converted reads in a BAM file? Having such a read tag would be another way to tackle the described problem, as you could manually set qualities of any mate of converted reads to zero. This would effectively lead to counting converted fragments, which is not perfect (different lengths, etc), but maybe would be good enough.

I am really curious what do you think? Also, if wrote anything stupid, don't hesitate to correct me! :)