mehdiborji / nanoranger

simplified cellranger for long-read data
MIT License
15 stars 3 forks source link

Fusion Calling Error #5

Open eseffar opened 4 months ago

eseffar commented 4 months ago

Hello, Thank you for this tool. I have 5' 10x Library sequenced with Nanopore Sequencing. I previously used JAFFAL to recover known fusion from Single-Cell which works quite well and I wanted to use your fusion detection pipeline using a fasta file to see how it performs with it. However, I encounter this error message on my own data:

alignment to genome and generation of BC-UMI-Transcript tagged BAM 

cores = 20
ref = /home/user/nanoranger/FUSION_SEQUENCE.fa
infile= FUSION_TEST/fusion_deconcat.fastq.gz
outdir = FUSION_TEST
sample = fusion
[M::mm_idx_gen::0.001*1.50] collected minimizers
[M::mm_idx_gen::0.001*5.99] sorted minimizers
[M::main::0.001*5.96] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.001*5.82] mid_occ = 15
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.002*5.70] distinct minimizers: 626 (98.72% are singletons); average occurrences: 1.032; average spacing: 2.913; total length: 1882
[M::worker_pipeline::0.734*16.79] mapped 103327 sequences
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -aY --eqx -x splice -t 20 --secondary=no --sam-hit-only /home/user/nanoranger/FUSION_SEQUENCE.fa FUSION_TEST/fusion_deconcat.fastq.gz
[M::main] Real time: 0.738 sec; CPU: 12.330 sec; Peak RSS: 0.053 GB
[bam_sort_core] merging from 0 files and 20 in-memory blocks...
number of genome aligned reads =  4693
10000 barcode candidates processed
20000 barcode candidates processed
30000 barcode candidates processed
40000 barcode candidates processed
50000 barcode candidates processed
60000 barcode candidates processed
70000 barcode candidates processed
80000 barcode candidates processed
number of short UMI reads =  250
20000 Read-BC-UMI-Transcript tuples saved
40000 Read-BC-UMI-Transcript tuples saved
60000 Read-BC-UMI-Transcript tuples saved
rm: cannot remove 'FUSION_TEST/fusion_matching_*': No such file or directory
`

Suprisingly I encounter the same error with the test data

 alignment to genome and generation of BC-UMI-Transcript tagged BAM 

cores = 8
ref = /home/user/nanoranger/data/RUNX1_RUNX1T1_ABL1_BCR.fa
infile= K562_Kasumi1/fusion_deconcat.fastq.gz
outdir = K562_Kasumi1
sample = fusion
[M::mm_idx_gen::0.001*1.89] collected minimizers
[M::mm_idx_gen::0.001*2.31] sorted minimizers
[M::main::0.001*2.30] loaded/built the index for 7 target sequence(s)
[M::mm_mapopt_update::0.002*2.22] mid_occ = 10
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 7
[M::mm_idx_stat::0.002*2.17] distinct minimizers: 2164 (96.63% are singletons); average occurrences: 1.035; average spacing: 2.973; total length: 6656
[M::worker_pipeline::0.050*6.04] mapped 3152 sequences
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -aY --eqx -x splice -t 8 --secondary=no --sam-hit-only /home/user/nanoranger/data/RUNX1_RUNX1T1_ABL1_BCR.fa K562_Kasumi1/fusion_deconcat.fastq.gz
[M::main] Real time: 0.050 sec; CPU: 0.303 sec; Peak RSS: 0.010 GB
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
number of genome aligned reads =  2883
number of short UMI reads =  4
rm: cannot remove 'K562_Kasumi1/fusion_matching_*': No such file or directory

Here is my working environment

The files present in the output directory for my data so far are : fusion_barcode_scores.csv fusion_barcode_scores.pdf fusion_bcumi_dedup.csv fusion_BCUMI.fasta.gz fusion_deconcat.fastq.gz fusion_genome_tagged.bam fusion_genome_tagged.bam.bai fusion_knee.pdf fusion_matching.sam fusion_trns_ct.csv

I was looking to have an output file with the reads + barcodes + presence of the fusion, but I'm not sure I've found this in any of these files. Do you have a wiki with the output files created and their content description? I guess I must use the fusion_gene.py in the downstream folder in scripts, but I am unsure of the arguments I need to fill in to use it. Also related to the script you provide, what is the script performing the extraction of the 10x barcodes? I saw that there are two bash scripts barcode_align.sh and barcode_ref.sh so I imagine those two which are called right ?

Thank you for your help, Evan

mehdiborji commented 4 months ago

Hi Evan,

Thank you for trying out our tool.

I can't easily see why you are getting the error for the rm command; however, this command only removes the intermediate files from barcode matching, and failure to remove those shouldn't interfere with the rest of the process as this is done in the last step of the pipeline.

It seems you have all the relevant files including the barcoded BAM file fusion_genome_tagged.bam. You have understood correctly that to process this BAM file you should run the fusion_gene.py script. You simply need to provide the path to this BAM file and an output file. I added a line and some explanation in the README. What this script does is simply to tabulate the information in the BAM file in a CSV for further processing, which is mostly manual and not very meaningful without having the gene expression data.

For each barcode, there will be ideally only one transcript and multiple PCR copies of that transcript (some of which might align to isoforms of the gene if you include them in reference FASTA such as RUNX1-205 and RUNX1-216 in our example).

The scripts for barcode matching are the ones you identified. However to extract the candidates from the reads utils.decon_5p10XGEX is used for 5' data and other decon functions for other modalities.

Let me know if this explanation, along with the README information, answers your questions. I'm happy to discuss this further if you would like to have a chat. I can potentially give your data a try if you would like to share it.

Best, Mehdi

eseffar commented 4 months ago

Hello, Thank you very much for your quick reply and all the explanations. First of all, it works! This is amazing; I was able to recover much more fusion than other methods, and the fact that we can target a specific fusion recovered many more events, which is great!

Concerning the rm command, I modified the line 162 in the pipeline.py script to remove the character _ after matching and it works now. I assumed it failed because the output file was named fusion_matching.sam and not fusion_matching_samplename.sam. Also, I was wondering if it's possible to optimize even more the barcode matching by giving in input the 10x barcodes of the cells that I sequenced also in Illumina? I saw there is a barcodes argument for this script. To give context, I have paired Illumina / nanopore for each sample, I'm able to detect almost 3000 cells harboring the fusion in one sample in the nanopore sequenced data, but 1/3 match the barcodes in the Seurat object created from the cells sequenced in Illumina. This is good but can be better. Let me know if it's clear enough or I can provide more details.

Thank you again for your help, this is a great tool!

Evan

mehdiborji commented 4 months ago

I'm glad it worked for you!

Very interesting that you see so many cells with fusions! I am curious if you are able to see this only/mostly in the expected cells or if many other cells, which you don't expect to bear fusions, seem to have fusion transcripts too.

The barcodes/reads you see reported as having fusion but not in your Seurat object may be some technical artifacts or weird combinations of errors that end up being reported as matching to a barcode in the whitelist.

This portion of the reads should be small in terms of the fraction of all the reads compared to the reads that belong to the real cells. For example, 2000 barcodes each with one read vs. 1000 barcodes each with an average of 30 reads, so 2000 out of 32000 reads not accounted for.

It's important to ensure that you filter potentially artifactual reads due to errors in sequencing/PCR. Also, depending on the conditions, these may be real molecules in the empty droplets.

Regarding obtaining more reads matched to the barcodes, you can certainly include only the list of barcodes you have identified as cells. To do this, you can provide the filtered_feature_bc_matrix/barcodes.tsv.gz from cellranger outputs with the flag --b as an additional input argument.

This approach will increase the specificity on those cells and speed up the barcode alignment process due to a smaller reference size. However, there's a chance of losing some precision by forcing 'some' reads to match your whitelist erroneously. As mentioned above, these reads may be technical artifacts or molecules in empty droplets. A middle ground could be selecting all the barcodes associated with droplets, which usually range from 30k to 100k (including all barcodes with 5-10 UMIs and higher), compared to the entire whitelist of 737k.

I have an idea of why the rm command doesn't seem to fail for me, and it may be an OS and/or Python-specific issue. For now, I have incorporated your change. Thank you for catching this issue!

Let me know how if you had any other questions.

Mehdi

eseffar commented 4 months ago

Hello,

Thank you for all the explanation again that's very useful to know! I am studying a fusion-driven cancer type and the cells that are labeled as having the fusion are expected. Our gene expression signature is concordant and has been validated by the nanopore data, specifically by JAFFAL and Fusion Seeker first findings and now by the nanoranger approach.

However, I have noticed that 8~9% of cells in other clusters not labeled as cancer cells are also harboring the fusion found by nanoranger. I am currently investigating this and trying to determine the reason for this mismatched occurrence. It could be due to many reasons as you suggested, or the use of raw barcodes instead of filtered barcodes in my case. There could also be a few mislabeled cells but I would not expect to have something as this level. I will investigate those specific cells. I will also use the exact fasta sequence of the fusion present in each sample found by FusionSeeker to determine if it can improve 1. the detection 2. lower the false positive as it will be a sample specific fusion sequence. Finally, I will try to filter out barcodes with a defined number of reads harboring the fusion to filter out false positives, as you mentioned.

Do you think it's possible to have a quality or confidence score for the barcodes matching when using the whitelist of barcodes as input ? I saw there is file named fusion_barcode_scores.csv to see the overall score of barcodes but I've not found a file with the score per barcode.

In definitive, 90~91% of cells harboring the fusion going to the right cluster is very good, it means to me that our signature to find cancer cells is correct, if we can take this score up that would be great. I have also match normal samples that I will use as a negative control to see if the tool finds any cells harboring the fusion, it will helps me to define a threshold % of error.

Thank you, Evan

mehdiborji commented 4 months ago

Hi Evan,

There's the alignment score that each candidate read gets. With 10x barcodes the highest is a perfect match (score 16 = 16nt matching). What you see in scores.csv file is a listing of all the scores for the reads that aligned to one unique barcode. Mostly they will be perfect matchers for Q20 kits and using entire 737k barcodes. With lower quality reads and/or usage of fewer barcodes in the whitelist the proportion of the reads mapping with some error will go up. I take reads with score of at least 14 meaning maximum one mismatch or indel. You can change this to include only perfect matches by modifying line if AS>=14 and read.flag==0: in utils.process_matching_5p10X function.

Some questions for you to think about the reason of observing fusion in non-malignant cells:

How long is fusion transcript? assuming its entirety is captured in 10x cDNA (highly unlikely for transcripts longer than 2-3kb). How how highly expressed the fusion transcript is? How good is the quality of library in terms of ambient RNA? How many reads and also unique UMIs (distinguishable by eye not collapsing identical UMIs, as errors in ONT doesn't allow very accurate quantification of UMIs especially in highly duplicated regime if you performed targeted PCR) barcodes in malignant cells have vs non-malignant ones.

Depending on the type of artifact and expression, you might be getting one UMI per non-malignant cell and many UMIs for malignant cells. Or you might be getting one real UMI with lots of supporting reads in the malignant cells and one fake UMI with 1 or 2 reads in the non-malignant.

Let me know if you have other questions.

Mehdi