timoast / sinto

Tools for single-cell data processing
https://timoast.github.io/sinto/
MIT License
118 stars 25 forks source link

filter barcodes parallel issue #17

Closed Pentayouth closed 2 years ago

Pentayouth commented 4 years ago

similar but different from issue #15 this is my original single cell bam file

@RG     ID:CH4-LN_2_L001        SM:CH4-LN_2     LB:CH4-LN_2     PL:illumina
@RG     ID:CH4-LN_2_L002        SM:CH4-LN_2     LB:CH4-LN_2     PL:illumina
@RG     ID:CH4-LN_2_L003        SM:CH4-LN_2     LB:CH4-LN_2     PL:illumina
@RG     ID:CH4-LN_2_L004        SM:CH4-LN_2     LB:CH4-LN_2     PL:illumina
@RG     ID:CH4-LN_2_L005        SM:CH4-LN_2     LB:CH4-LN_2     PL:illumina
@RG     ID:CH4-LN_2_L006        SM:CH4-LN_2     LB:CH4-LN_2     PL:illumina
@RG     ID:CH4-LN_2_L007        SM:CH4-LN_2     LB:CH4-LN_2     PL:illumina
@PG     ID:STAR PN:STAR VN:2.7.4a       CL:STAR   --runThreadN 8   --genomeDir /data/wangzw/dropseqMetadata_b37/STAR   --readFilesIn unaligned_mc_tagged_polyA_filtered.fastq      --outFileNamePrefix star   --outReadsUnmapped Fastx   --twopassMode Basic
@PG     ID:0    PN:TagReadWithGeneFunction      CL:TagReadWithGeneFunction INPUT=merged.bam OUTPUT=star_gene_exon_tagged.bam ANNOTATIONS_FILE=/data/wangzw/dropseqMetadata_b37/b37.refFlat    GENE_NAME_TAG=gn GENE_STRAND_TAG=gs GENE_FUNCTION_TAG=gf READ_FUNCTION_TAG=XF USE_STRAND_INFO=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false  VN:2.3.0(34e6572_1555443285)
ST-E00522:612:H2WHKCCX2:7:1220:14093:56985      16      1       10166   0       49M44N32M       *       0       0       CCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAAGCCCTAACCCTAACCCTAACCCTAACCCTAACCC       JJJJJJJJFJJFJJJFAJJJJAJJJJJJ7JJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFAJJJJJFJJJJFFFAA       XC:Z:CCTTGTCGACTC
       MD:Z:47C33      XF:Z:INTERGENIC PG:Z:STAR       RG:Z:CH4-LN_2_L003      NH:i:5  NM:i:1  XM:Z:TTGGCCTC   ZP:i:82 UQ:i:41 AS:i:79
ST-E00522:612:H2WHKCCX2:7:2216:25540:11839      16      1       11743   1       150M    *       0       0       TGACGATTTTGCTGCATGGCCGGTGTTGAGAATGACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCT  <7<-A-JJA7FFAAA))7)7A-7AFF7JA<F7JJ<JJJJFFA-<AJJJFF
JJFFF--777JA<FA-7JF7AJJFJJJJJJJFJJFJJ<J7J<JJJJFJFJF<<FJJFJ<JJJJJJJJFJFFFJJJJJFA7-JJAFF<JJJFFJJJAAAAA  XC:Z:GAGACGAGGCCC       MD:Z:3T146      XF:Z:INTERGENIC PG:Z:STAR       RG:Z:CH4-LN_2_L003      NH:i:3  NM:i:1  XM:Z:ACTGGCCA   UQ:i:12 AS:i:146        gf:Z:CODING,INTERGENIC  gn:Z:DDX11L1,DDX11L1    gs:Z:+,+
ST-E00522:612:H2WHKCCX2:6:1124:17797:16041      16      1       11762   1       150M    *       0       0       CCGGTGTTGAAAATGACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTT  JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA  XC:Z:CTTCTTCCGCTT       MD:Z:10G139     XF:Z:INTERGENIC PG:Z:STAR       RG:Z:CH4-LN_2_L007      NH:i:3  NM:i:1  XM:Z:ATGGAGGA   UQ:i:41 AS:i:146        gf:Z:CODING,INTERGENIC  gn:Z:DDX11L1,DDX11L1    gs:Z:+,+

which has 7 RG from CH4-LN_2_L001 to CH4-LN_2_L007

I want to extract reads tagged by 100 cell barcodes, and here is my code: sinto filterbarcodes -b star_gene_exon_tagged.bam -c xcForTest.txt --barcodetag XC -p 16

xcForTest.txt has 100 cell barcode,like this:

ACCGTCAGCGAT    subset
GTTCAGAATAGC    subset
GCAACACGAGTG    subset
GCTTCACCCTTA    subset
TCGATCCACGAG    subset
CACGCCAATTAG    subset
CGACCGGGAAAA    subset
CAAGCATATGCA    subset
CTCATGTTGTAG    subset
TCCTCCGACCCA    subset
...

when i was checking the output bam, which is subset.bam, using code: samtools view -h ./subset.bam | grep "CH4-LN_2_L007-" | less I found some unexpected records like:

ST-E00522:612:H2WHKCCX2:6:2112:30259:10521      16      1       197113101       255     25M     *       0       0       GCTCTTCTGCATTTCCTAGTAATAT       JJJJJJJJJJJJJJJJJJJJFFFAA       XC:Z:CTAACAAGTTCT       MD:Z:25 XF:Z:CODING     NH:i:1  NM:i:0  XM:Z:TGAGCGGG   ZP:i:26 UQ:i:0  AS:i:24 gf:Z:CODING     gn:Z:ASPM
       gs:Z:-  RG:Z:CH4-LN_2_L007-364D2CCB     PG:Z:STAR-7303B068
ST-E00522:612:H2WHKCCX2:6:2103:22922:9027       16      1       197113183       255     48M2040N102M    *       0       0       TTCTTTAATTACTCTCCACTTAACAGAAATAACAATTTTCTCTTTAGGCTGCAACACGAAACAGCGCTGCGACACACTGAAGCCCAGGTCCGCGGCCGGGAAGTGGGAGATCTTCACTTCTGCCACCTCCTCGTTAGGGTTGTCTAGGGC  <FF7---7-7<A<FJFF<-JJJJJAJFFAFFAFJFJAF7JJFFA7-7-AFJJFJAJAJJJ<JFJJFFAJJFFA<FJJFFA7AA77--AAA7A-7AJ-JFJJA7-<AJJJJJJJAJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJJJFFFAA  XC:Z:CTAACAAGTTCT       MD:Z:6G1G1G0G1G4G131    XF:Z:CODING     NH:i:1  NM:i:6  XM:Z:TGAGCGGG   UQ:i:132        AS:i:137        gf:Z:CODING     gn:Z:ASPM       gs:Z:-  RG:Z:CH4-LN_2_L007-364D2CCB
     PG:Z:STAR-7303B068
ST-E00522:612:H2WHKCCX2:6:2208:13829:28611      16      1       197122849       255     150M    *       0       0       AAGTAAAACAAAGAACTAGTTCAATATACAGTACACTTCCTACTCTTCACAGAGAACTGAAATTTTCTATAAAGACATTTATACTTAGGAAACATCAGACAACCAAAGTATGTATAAAACTCACAAGATATTTTACACACAGTTCACAAT  AA--A<-F<F<<7<F7-77<<JFF<F7<-A<FFF77-FFF<-AFAAFJJJJJJFJJFJJJAAFJFJJFJFJJJJFJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA  XC:Z:GTGATGTCGGCT       MD:Z:6C143      XF:Z:UTR        NH:i:1  NM:i:1  XM:Z:GCGTGAGG   UQ:i:12 AS:i:146        gf:Z:UTR        gn:Z:ZBTB41     gs:Z:-  RG:Z:CH4-LN_2_L007-364D2CCB     PG:Z:STAR-7303B068
ST-E00522:612:H2WHKCCX2:6:2105:16122:62505      16      1       197168944       255     150M    *       0       0       CTTTTTATAACAAAAATGTCTACTACAGAATTTGCACTGATGATTATTTGATAGTCTTCCAGTTAATTCATTTAGTGTTTCTTCTGGTGATGACTTTTCACTTAGCTCTGAATGAAAAGGGGCAACATTTTCGTTATTTAACAACTTCAC  FA<-<7JJJJJJJJJJJJFAJJJJFAJJJJJFJAAFJJJJJJJJFFJJJJJJJJJJFJJJJJJJFJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJAJFJJJJAF7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFAFAA  XC:Z:GAGTCGAGCGAA       MD:Z:100G49     XF:Z:CODING     NH:i:1  NM:i:1  XM:Z:AGGGGCGG   UQ:i:37 AS:i:146        gf:Z:CODING     gn:Z:ZBTB41     gs:Z:-  RG:Z:CH4-LN_2_L007-364D2CCB     PG:Z:STAR-7303B068

'-364D2CCB' were added to the cell barcode tag, which influence my downstream process. Is it a bug?

timoast commented 4 years ago

I can confirm this is a bug causing extra characters to be added to the read group entries (not cell barcode tag). A workaround is to use only one core -p 1

ghuls commented 3 years ago

Recent versions of samtools support extracting all reads with have a certain list of barcodes.

samtools view -@ 4 -b -o subset.bam -D XC:subset_barcodes.txt input.bam

With subset_barcodes.txt

ACCGTCAGCGAT
GTTCAGAATAGC
GCAACACGAGTG
GCTTCACCCTTA
TCGATCCACGAG
CACGCCAATTAG
CGACCGGGAAAA
CAAGCATATGCA
CTCATGTTGTAG
TCCTCCGACCCA    
...
Pentayouth commented 3 years ago

Thank you so much 👍

timoast commented 2 years ago

This should now be fixed in the latest release