timoast / sinto

Tools for single-cell data processing
https://timoast.github.io/sinto/
MIT License
118 stars 25 forks source link

Fragments file has size of zero #32

Closed prmunn closed 3 years ago

prmunn commented 3 years ago

Hi - I'm new to sinto and I'm unable to to produce a fragments file. My cell barcodes are in the header rows of my bam file, between the first and second underscore, and I have numeric values for my chromosomes, with no "chr" at the beginning. So, I need to know what the regex pattern would look like for both the --barcode_regex and --use_chrom options, and I need to know how to stop the --barcodetag using the default of "BC". Here are the first 10 rows of my bam file:

E00558:642:HFL3TCCX2:8:2106:29812:51834_ATTGAATTACAGCCGTCTTACACTGA_ATGCCATTCT 163 10 3100324 0 87M = 3100324 87 CATTTACACAATGGAATACTACTCAGCTATTAAAAAATGAATTTATGAAATTCCTAGGCAAATGGATGGACCTGGAGGGTATCATCC JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   NM:i:0  MD:Z:87 MC:Z:87M AS:i:87 XS:i:87 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2106:29812:51834_ATTGAATTACAGCCGTCTTACACTGA_ATGCCATTCT 83 10 3100324 0 87M = 3100324 -87 CATTTACACAATGGAATACTACTCAGCTATTAAAAAATGAATTTATGAAATTCCTAGGCAAATGGATGGACCTGGAGGGTATCATCC JJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA   NM:i:0  MD:Z:87 MC:Z:87M AS:i:87 XS:i:87 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2206:16802:5212_TGAAGAAACCGTTTGTTTACACAACA_GCNATCCATC 99 10 3100767 0 62M = 3100767 62 ATGCCGGGGCCTAGCAAACACAGAAGTGGATGATCACAGTCAGCTATTGGATGGGTCACACG   AAFFFJJJJJJJJJJJ-JJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJFJJJJJJ   NM:i:0  MD:Z:62 MC:Z:43S62M     AS:i:62 XS:i:62   RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2206:16802:5212_TGAAGAAACCGTTTGTTTACACAACA_GCNATCCATC 147 10 3100767 0 43S62M = 3100767 -62 AGTTTCTTCATCGTCGGCAGCGTCAGATGTGTATGAGATACAGATGCCGGGGCCTAGCAGACACAGAATTGGATGATCACAGTCAGCTATTGGATGGGTCACACG    <FFFJ<-JFFFJJJJJJF77F-JJJF<-J-JAJF-JF7AA-7F-JJJJJ7JF-FAA7-<-F<-<A-<F-J7JJJJJJJ7FJF-F7JJJJJJJF-JAAJAFF-JF- NM:i:2 MD:Z:16A8G36 MC:Z:62M AS:i:52 XS:i:57 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2118:27225:71717_TGAAGAAAGCAGTAGATGAGAGTTAT_ATATGCTCGC 163 10 3104258 60 117M = 3104509 401 AGTGTGTAGCTTATTAGTGGGGTGTTTGGCAGCATACATGAGGTTTTAGATTAAATCCCCCTGTTACAAAATAAGTAAAAGAGCATATCAGACACACCCCCCCATAGGAAAGAACAA        JJ7FJJFJJJJJJJJJJFF<FFJJJFJJJJJJJJJJJFFJJJJAJFJJJJJJJJJJFAAJF7AF-<AJJJJFAJJJJJJJJJJ-AF<FJJJJJFF<<-7777A-7FF7-AJJJJ-AA NM:i:1 MD:Z:95C21 MC:Z:150M AS:i:112 XS:i:93 RG:Z:BPA1 XA:Z:10,-7456812,20M2I92M3S,5;10,+22240801,94M2D23M,7;10,+22431027,94M3D23M,8;
E00558:642:HFL3TCCX2:8:2120:9628:66127_TGAAGAAAGCAGTAGATGAGAGTTAT_ATATGCTCNC 163 10 3104273 0 15S68M33S = 3104509 386 AGTGTGTAGGTGATGAGTGGGGGGTTTGTCAGAATACATGAGGATTTAGATGAAATCACCCGGATACAAAAGAAGTAAAAGAGAATAAAAGACGGCACAGAGCATATAATAAAACA      AA<-FFJA7-7-FAA-----FJ---7-<--A<-77F<<--A-----7FA-<7-<77-<-A-----7-<7---7----7--A-A<------<----77--A--7--7-77-<-7-7< NM:i:9 MD:Z:7T5G3C10T7T5C3T1T7T11 MC:Z:150M AS:i:23 XS:i:23 RG:Z:BPA1        XA:Z:8,-16209092,29S21M66S,0;1,+129147041,65S20M31S,0;14,-14404651,24S19M73S,0;10,-113473301,33S19M64S,0;15,+60550338,70S19M27S,0;
E00558:642:HFL3TCCX2:8:2118:27225:71717_TGAAGAAAGCAGTAGATGAGAGTTAT_ATATGCTCGC 83 10 3104509 60 150M = 3104258 -401 TCAGTAGGCAGACAGGAATAACCAAGGCCAGAAGATAATCTCTTTCCAATGGGCATAGAACCCTTCACTCTGCAGGCTGAGATGTGTTGCCATTATGAAGGAGATAAAAGTTTCAGGGGATCTTGTGTTGTTAGCCTCAATGGAAAGAAC       FJJJFJFJJJJJJJFJJJJJJFJJJJJJJJJJJJJJJJJJJJAFFJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA    NM:i:0  MD:Z:150        MC:Z:117M       AS:i:150        XS:i:102        RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2120:9628:66127_TGAAGAAAGCAGTAGATGAGAGTTAT_ATATGCTCNC 83 10 3104509 60 150M = 3104273 -386 TCAGTAGGCAGACAGGAATAACCAAGGCCAGAAGATAATCTCTTTCCAATGGTCATAGAACCCTTCACTCTGCAGGCTGAGATGTGTTGCCATTATGAAGGAGATAAAAGTTTCAGGGGATCTTGTGTTGTTAGCCTCAATGGAAAGAAC        A7JJFJAF<-F7-F<JJJJJF7-A777--FJJJFJJAA7FFJFFFJJJJFA7-JFJJJFAA-7AAF-FJJJJFJFFJJJAJFJJJJF<FJJJJJJFJJFAJJAJJJJJJJJJJJJJFJJJ<JJJJJJJFFJJJJJA-FJJJJJJFFFFAA    NM:i:1  MD:Z:52G97      MC:Z:15S68M33S  AS:i:145        XS:i:97 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2201:25083:16006_TGAAGAAAGCAGTAGAGCACTTGGCG_TATGCNTTAC 99 10 3104650 60 150M = 3105087 553 GGAAAGAACATGTTCATGTTGACACAAGCACTGGCAACTGGACTCAATTGGATCCTAGATTGAAGAAGAGTATAGAAATAGGGAAGGAAGACAGGACTCGATCTTCCTTCTTAGAGAAGACTACAGAGGGTGACTGCAAGACCTGGCGTG        AAFFFJJJFJJJFJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJAJJJJFJ<FJFJFFJFJJJAFAFJ7JFFAJJ7F<7FAJJFJAAJJJJJJJJJJJJFFJFJAJJJ<AAJJFJAJJJJJJJJJFAAAAJ7FFA<AFJJF7FJ77<--<    NM:i:1  MD:Z:147T2      MC:Z:116M       AS:i:147        XS:i:95 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2201:25083:16006_TGAAGAAAGCAGTAGAGCACTTGGCG_TATGCNTTAC 147 10 3105087 60 116M = 3104650 -553 GTGCGGAAGAGGAGGCACACAACATGTAAGAACCAGAGGGGATTGAGGACACCAAGGATTTCTCCTCTTAAGTCAACACGATCCACACACATATGAACTCACAGGTACTGGAGTAG        7FF7JJJFJFAF7-F-7F-AFFJJJJAAFFJFJJJA<JFAFFJJJJJ<FJJJJJJJJJJFA-A-FFFJJJFAFFFJJJJJJJJJJFFJJJJJJJFF7JJJJJJJJJJAAFJJJJJJ NM:i:0 MD:Z:116 MC:Z:150M AS:i:116 XS:i:103 RG:Z:BPA1 XA:Z:10,+7455967,116M,3;10,+3205176,112M4S,4;
timoast commented 3 years ago

Hi @prmunn, to extract cell barcodes from the read name you can provide a regular expression, in your case something like --barcode_regex "(?<=_)(.*)(?=_)" should work. When looking into this, I realized that it's currently difficult to match strings that don't start at the beginning of the read name, so I have made a change that should make that easier (you'll need to install from github).

If you set the --barcode_regex parameter the --barcodetag parameter is not used (see https://timoast.github.io/sinto/basic_usage.html#create-scatac-seq-fragments-file). You can set --use_chrom "" to match all chromosomes.

prmunn commented 3 years ago

Yep - that worked. Many thanks!