haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
192 stars 21 forks source link

[ERROR] barcode length is greater than 32 #125

Closed rargelaguet closed 1 year ago

rargelaguet commented 1 year ago

Hello, I am running chromap to process scATAC-seq from 10x chromium as follows:

GENOME_FASTA="(...)/refdata-gex-GRCh38-2020-A/fasta/genome.fa"
GENOME_INDEX="(...)/refdata-gex-GRCh38-2020-A/fasta/hg38.chromap.index"
FASTQ_FOLDER="(...)"
BARCODE_WHITELIST="(...)/cellranger-atac-2.1.0/lib/python/atac/barcodes/737K-cratac-v1.txt.gz"

# Run chromap
chromap -t 8 \
    --preset atac \
    --index $GENOME_INDEX \
    --ref $GENOME_FASTA \
    --read1 ${FASTQ_FOLDER}/adult_hADSCs_S1_L001_R1_001.fastq.gz \
    --read2 ${FASTQ_FOLDER}/adult_hADSCs_S1_L001_R3_001.fastq.gz \
    --output adult_hADSCs_chromap_fragments.tsv \
    --barcode ${FASTQ_FOLDER}/adult_hADSCs_S1_L001_R2_001.fastq.gz \
    --barcode-whitelist $BARCODE_WHITELIST

where the R1 and R3 files are 50bp read counts and the R2 files is the cell barcode (16nt):

@SRR16889312.1 A00808:525:HVVM5DRXX:1:2101:1063:1110 length=16
TGGAATGCTACGTGCC
+SRR16889312.1 A00808:525:HVVM5DRXX:1:2101:1063:1110 length=16
FFFFFFFFFFFFFFFF

I however get the following output and error:

Preset parameters for ATAC-seq/scATAC-seq are used.
Start to map reads.
Parameters: error threshold: 8, min-num-seeds: 2, max-seed-frequency: 500,1000, max-num-best-mappings: 1, max-insert-size: 2000, MAPQ-threshold: 30, min-read-length: 30, bc-error-threshold: 1, bc-probability-threshold: 0.90
Number of threads: 8
Analyze single-cell data.
Will try to remove adapters on 3'.
Will remove PCR duplicates after mapping.
Will remove PCR duplicates at cell level.
Won't allocate multi-mappings after mapping.
Only output unique mappings after mapping.
Only output mappings of which barcodes are in whitelist.
Perform Tn5 shift.
Output mappings in BED/BEDPE format.
Reference file: /bi/group/reik/ricard/data/hg38_sequence/refdata-gex-GRCh38-2020-A/fasta/genome.fa
Index file: /bi/group/reik/ricard/data/hg38_sequence/refdata-gex-GRCh38-2020-A/fasta/hg38.chromap.index
1th read 1 file: /bi/group/reik/ricard/data/Guan2022_chemical_reprogramming/original/atac/test/chromap/adult_hADSCs_S1_L001_R1_001.fastq.gz
1th read 2 file: /bi/group/reik/ricard/data/Guan2022_chemical_reprogramming/original/atac/test/chromap/adult_hADSCs_S1_L001_R3_001.fastq.gz
1th cell barcode file: /bi/group/reik/ricard/data/Guan2022_chemical_reprogramming/original/atac/test/chromap/adult_hADSCs_S1_L001_R2_001.fastq.gz
Cell barcode whitelist file: /bi/group/reik/ricard/data/software/cellranger-atac-2.1.0/lib/python/atac/barcodes/737K-cratac-v1_revcomp.txt.gz
Output file: adult_hADSCs_chromap_fragments.tsv
Loaded all sequences successfully in 7.81s, number of sequences: 194, number of bases: 3099750718.
Kmer size: 17, window size: 7.
Lookup table size: 393121381, occurrence table size: 447341405.
Loaded index successfully in 24.71s.
Loaded sequence batch successfully in 0.00s, number of sequences: 1000, number of bases: 16000.
ERROR: barcode length is greater than 32!

Any idea why it does not detect the right barcode length (16nt?)

Thanks, Ricard

haowenz commented 1 year ago

Which version were you using? Can you run ‘’’chromap -v’’’ to get the version and let us know?

rargelaguet commented 1 year ago

Sorry I forgot to include this information.

I installed it from bioconda

> chromap --version
0.2.3-r407
haowenz commented 1 year ago

Cell barcode whitelist file: /bi/group/reik/ricard/data/software/cellranger-atac-2.1.0/lib/python/atac/barcodes/737K-cratac-v1_revcomp.txt.gz

It looks like you were using gz file for whitelist, which is not supported yet. See #51. You can try --barcode-whitelist <(zcat whitelist.tsv.gz) or unzip it and use the txt file.

rargelaguet commented 1 year ago

Seems to be working now, thank you!