immunomind / immunarch

šŸ§¬ Immunarch: an R Package for Fast and Painless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires
https://immunarch.com
Apache License 2.0
296 stars 65 forks source link

repLoad cannot read 10X single-cell TCR-seq data #355

Open dzhaobio opened 1 year ago

dzhaobio commented 1 year ago

šŸ› Bug

Chemistry Single Cell V(D)J R2-only V(D)J Reference vdj_GRCm38_alts_ensembl-4.0.0 Pipeline Version cellranger-5.0.1

Install the packages as below by referring to https://github.com/immunomind/immunarch/issues/342

install.packages(c("devtools", "pkgload"))
devtools::install_github("immunomind/immunarch", ref="dev")
devtools::reload(pkgload::inst("immunarch"))

immdata <- repLoad("/gpfs/raw_cellranger/TCR/P1_TCR_VMT/outs/")

== Step 1/3: loading repertoire files... ==

Processing "/gpfs/raw_cellranger/TCR/P1_TCR_VMT/outs/" ... -- [1/14] Parsing "/gpfs/raw_cellranger/TCR/P1_TCR_VMT/outs//airr_rearrangement.tsv" -- airr -- [2/14] Parsing "/gpfs/raw_cellranger/TCR/P1_TCR_VMT/outs//all_contig_annotations.csv" -- 10x (filt.contigs) [!] Removed 7394 clonotypes with no nucleotide and amino acid CDR3 sequence.
-- [3/14] Parsing "/gpfs/raw_cellranger/TCR/P1_TCR_VMT/outs//all_contig_annotations.json" -- unsupported format, skipping -- [4/14] Parsing "/gpfs/raw_cellranger/TCR/P1_TCR_VMT/outs//all_contig.bam.bai" -- Error in stri_trim_both(string) : invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8() In addition: Warning message: In readLines(f, 1) : line 1 appears to contain an embedded nul

To Reproduce

Expected behavior

Additional context

brucewayne521 commented 1 year ago

I met the same issue. Any idea how to fix it?

vadimnazarov commented 1 year ago

Hi everyone, could you provide a couple of example datasets to test it? Thank you!

ImNotaGit commented 6 months ago

It seems that this is related to #92 and #311. The solution was to move the .bai file elsewhere or create a new dir containing only the necessary data files, w/o the .bai file.

However, with a quick check of the source code of repLoad I found this line:

exclude_extensions <- c(
    "so", "exe", "bam", "fasta", "fai", "fastq", "bed", "rds", "report", "vdjca"
  )

I wonder whether a better fix should be as simple as adding "bai" to this list.