shendurelab / fly-atac

Code relevant to sci-ATAC-seq of Drosophila embryogenesis.
MIT License
22 stars 7 forks source link

barcode issue #7

Closed bakerwm closed 4 years ago

bakerwm commented 4 years ago

@cusanovich , What is the structure of the barcode? and the format of fastq file?

And sc_atac_10bpbarcode_split.py script failed to process the ATAC-seq fastq files.

Here are the errors, barcode not found in the name of fastq file.

$  python sc_atac_fastq2bam.py -R1 SRR5837698.sra_1.fastq.gz -R2 SRR5837698.sra_2.fastq.gz -O results -P demo -G ~/data/genome/dm6/bowtie_index/dm6
... sc_atac_10bpbarcode_split.py", line 89, in <module>
    barcodes = line.strip().split()[1].split(":")[3]
IndexError: list index out of range

Here are the first pair read of SRR5837698:

$ zcat SRR5837698.sra_1.fastq.gz | head -n 4
@SRR5837698.sra.1 ATTCAGAACCGCTAAGAGNAAGATTATTAGATTCCG:1 length=51
GGCTTNTATTATGACCGCAATGAAGTCCGATCGCAGATAATCCGCAAAGGA
+SRR5837698.sra.1 ATTCAGAACCGCTAAGAGNAAGATTATTAGATTCCG:1 length=51
AAAAA#EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
(py27) wangming@yulab: ~/work/yu_2019/projects/public_data/2018_scATACseq_nature

$ zcat SRR5837698.sra_2.fastq.gz | head -n 4
@SRR5837698.sra.1 ATTCAGAACCGCTAAGAGNAAGATTATTAGATTCCG:1 length=51
GGATACGATTTCTTTCTAAAAAGATGACCCATTTTGATTTAAGTAATTTTG
+SRR5837698.sra.1 ATTCAGAACCGCTAAGAGNAAGATTATTAGATTCCG:1 length=51
AAAAAEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEA
dagarfield commented 4 years ago

In the case of: "@SRR5837698.sra.1 ATTCAGAACCGCTAAGAGNAAGATTATTAGATTCCG:1 length=51"

The barcode is simply

ATTCAGAACCGCTAAGAGNAAGATTATTAGATTCCG

So to get the fastq file into a format that works, you just need to apply a bit of editing. Your end goal is to have in the final BAM file the read name being identical to the cell barcode.