Open Simon-Coetzee opened 7 years ago
@ST-K00126:307:HFM3NBBXX:1:1101:3772:1244 2:N:0:NTCGCCCT
NAAGCCAGTTGTGAATCATGCACATCAGCTCCTTCTGAAATGTGTTTATGGCCTAGGACACAGGGACCCTGGAGACTATGGTGCTGCAGTGCATTATG
+
#<<A<FJJJFJFJJJJJJJJJFJFJJJJJJJJJJJJJJJFJJFFJJJJJAFJJFJF7JJJJFJAJJJ<J<7-A<FFFFJ-F<FJJJJJJJJJJ7FJJA
is what i meant for R2
@Simon-Coetzee, this is a correct description of the 10X V2 chemistry. I believe concatenating the sample and cellular barcodes, however, is incorrect (merge_barcodefiles_10x()
, args['barcode_start'] = 0
, args['barcode_end'] = 26
).
This is because 10X uses four 8 bp oligonucleotides per sample index to address sequencing biases. This can be easily observed with any sample barcode file:
zcat SAMPLE_I1_001.fastq.gz | awk '{if(NR % 4 == 2) {a[$1] += 1}} END {for(x in a) {print x "\t" a[x]}}' | sort -k2,2gr
Concatenating the sample and cellular barcodes will (I think) result in reads for a given cell being associated with four different barcodes. Using just the 16 bp cellular barcode should avoid these issues.
10x Barcodes with v2 chemistry work like this: examples from
merge_barcodefiles_10x()
looks like there may be some confusion about which file does what?I1 = sample barcode (SB) (8 bp)
python regex:
(@.*)\n(?P<SB>.*)\n+(.*)\n(.*)\n
R1 = cellular barcode (CB) (16 bp) + molecular barcode (MB) (umi) 10bp
python regex:
(@.*)\n(?P<CB>.{16})(?P<MB>.{10})\n+(.*)\n(.*)\n
R2 = rna reads (98 bp)
python regex:
(?P<name>@.*) .*\n(?P<seq>.*)\n+(.*)\n(?P<qual>.*)\n