t-neumann / slamdunk

Streamlining SLAM-seq analysis with ultra-high sensitivity
GNU Affero General Public License v3.0
37 stars 22 forks source link

KeyError: "tag 'XI' not present". what does identity value mean? #136

Open zhouyjjj opened 11 months ago

zhouyjjj commented 11 months ago

Hi Tobias,

I am trying to use STAR mapped file as input of slamdunk, as I have some junction reads that ngm seems unable to handle properly.

And the filter didn't work correctly, the output looks like this:

Creating output directory: ./f10sd Running slamDunk filter for 1 files (12 threads) joblib.externals.loky.process_executor._RemoteTraceback: """ Traceback (most recent call last): File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker r = call_item() File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 285, in call return self.fn(*self.args, self.kwargs) File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in call return self.func(*args, *kwargs) File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 252, in call return [func(args, kwargs) File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 252, in return [func(*args, **kwargs) File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/slamdunk/slamdunk.py", line 170, in runFilter filter.Filter(bam, outputBAM, getLogFile(outputLOG), bed, mq, minIdentity, maxNM, printOnly, verbose) File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/slamdunk/dunks/filter.py", line 270, in Filter mappedReads, unmappedReads, filteredReads, mqFiltered, idFiltered, nmFiltered, multimapper = multimapUTRRetainment (infile, outfile, bed, minIdentity, NM, log) File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/slamdunk/dunks/filter.py", line 107, in multimapUTRRetainment if(float(read.get_tag("XI")) < minIdentity): File "pysam/libcalignedsegment.pyx", line 2507, in pysam.libcalignedsegment.AlignedSegment.get_tag File "pysam/libcalignedsegment.pyx", line 2546, in pysam.libcalignedsegment.AlignedSegment.get_tag KeyError: "tag 'XI' not present" """

So I checked the filter.py finding that it asked for "XI" and "NM" tag from sam file which STAR aligner does not output. I think the "NM" tag can be substitute with the "nM" tag in STAR map, but I am not sure what can be used to replace "XI" tag. Also, I am curious why "minIdentity" is set to 0.8 in filter.py. Can you give me some hint on how to get a identity value from STAR aligner.

Thank you!

zhouyjjj commented 11 months ago

Hi Tobias,

I also met a problem that says no RG:

Running slamDunk tcount for 1 files (12 threads) joblib.externals.loky.process_executor._RemoteTraceback: """ Traceback (most recent call last): File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker r = call_item() File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 285, in call return self.fn(*self.args, self.kwargs) File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in call return self.func(*args, *kwargs) File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 252, in call return [func(args, kwargs) File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 252, in return [func(*args, *kwargs) File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/slamdunk/slamdunk.py", line 202, in runCount tcounter.computeTconversions(ref, bed, inputSNP, bam, maxLength, minQual, outputCSV, outputBedgraphPlus, outputBedgraphMinus, conversionThreshold, log) File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/slamdunk/dunks/tcounter.py", line 129, in computeTconversions sampleInfo = getSampleInfo(bam) File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/slamdunk/utils/misc.py", line 239, in getSampleInfo sampleInfo = getReadGroup(bam) File "/public/slst/home/lict/anaconda3/lib/python3.8/site-packages/slamdunk/utils/misc.py", line 236, in getReadGroup raise RuntimeError("Could not get mapped/unmapped/filtered read counts from BAM file. RG is missing. Please rerun slamdunk filter.") RuntimeError: Could not get mapped/unmapped/filtered read counts from BAM file. RG is missing. Please rerun slamdunk filter. """ But I don't see a RG header in "Slamdunk all" generated filtered sam, this is how the top of this file looks: A01890:220:H2CNYDSX7:4:1545:9498:5916 0 chr1 13038 2 138M 0 0 TCAACCAGTCCATAGGCAAGCCTGGCTGCCTCCAGCTGGGTCGACAGACAGGGGCTGGAGAAGGGGAGAAGAGGAAAGTGAGGTTGCCTGCCCTGTCTCCTACCTGAGGCTGAGGAAGGAGAAGGGGATGCACTGTTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:1380 NM:i:0 NH:i:1 XI:f:1 XA:i:0 X0:i:1XE:i:40 XR:i:138 MD:Z:138 RG:Z:0 TC:i:0 RA:Z:33,0,0,0,0,0,31,0,0,0,0,0,51,0,0,0,0,0,23,0,0,0,0,0,0, A01890:220:H2CNYDSX7:4:1545:9507:5932 0 chr1 13038 2 138M * 0 0 TCAACCAGTCCATAGGCAAGCCTGGCTGCCTCCAGCTGGGTCGACAGACAGGGGCTGGAGAAGGGGAGAAGAGGAAAGTGAGGTTGCCTGCCCTGTCTCCTACCTGAGGCTGAGGAAGGAGAAGGGGATGCACTGTTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:1380 NM:i:0 NH:i:1 XI:f:1 XA:i:0 X0:i:1XE:i:40 XR:i:138 MD:Z:138 RG:Z:0 TC:i:0 RA:Z:33,0,0,0,0,0,31,0,0,0,0,0,51,0,0,0,0,0,23,0,0,0,0,0,0,

And this is what i generated filtered bam from STAR mapped files: A01890:220:H2CNYDSX7:4:1545:9498:5916 0 chr1 13026 255 150M 0 0 GCATGAAGGCTGTCAACCAGTCCATAGGCAAGCCTGGCTGCCTCCAGCTGGGTCGACAGACAGGGGCTGGAGAAGGGGAGAAGAGGAAAGTGAGGTTGCCTGCCCTGTCTCCTACCTGAGGCTGAGGAAGGAGAAGGGGATGCACTGTTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:148 nM:i:0 A01890:220:H2CNYDSX7:4:1545:9507:5932 0 chr1 13026 255 150M 0 0 GCATGAAGGCTGTCAACCAGTCCATAGGCAAGCCTGGCTGCCTCCAGCTGGGTCGACAGACAGGGGCTGGAGAAGGGGAGAAGAGGAAAGTGAGGTTGCCTGCCCTGTCTCCTACCTGAGGCTGAGGAAGGAGAAGGGGATGCACTGTTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:148 nM:i:0

The RG is not in the header of sam file, is it the one behind the tags? And i also rerun filter.py, it does not solve this problem.

thanks.

t-neumann commented 11 months ago

Hi @zhouyjjj - sorry for the long radio silence, I was on holiday.

Currently Slamdunk really only supports NGM alignments due to the said requirement of certain BAM tags to be present.

The identity value basically just states how much of the original read sequence was aligned in some way or another.

Can you send me the actual command where you call slamdunk all so I can have a look?

zhouyjjj commented 11 months ago

Hi Tobias,

This is how my command looks like:

slamdunk all -r /public/slst/home/lict/ZYJ/referenceseq/GRCh38.p14.genome.fa -b /public/slst/home/lict/ZYJ/simu/genlist.bed -5 8 -n 100 -m -t 12 -o ./ --skip-sam /public/slst/home/lict/ZYJ/result/fastqc/I2_L1_1_val_1.fq /public/slst/home/lict/ZYJ/result/fastqc/I2_L1_2_val_2.fq

The problem is ngm works extremely slow with my actual seq data, which is around 18g per file, and mapping rate is significantly lower than using STAR. So I am still trying to use STAR mapped sam as input. Currently I find replacement for XI ,RG, NM tags, the only problem is MPtag now.

t-neumann commented 11 months ago

What you could try is to crank the number of threads up even more - maybe to 30 and see if that helps.

To help with the tags, could you instead of supplying the files via commandline, type up a sample sheet in the appropriate formate and run it with this?

https://t-neumann.github.io/slamdunk/docs.html#sample-file