Question about UMI and and ambiguous J annotation

ljouneau commented 2 years ago

Thank you for developing this tool !

I work on a bulk RNA-Seq Illumina sequencing result (BCR sequences have been caught using a C primer and a 5‘RACE protocol). We have introduced UMI in these sequences and these UMI are identified and associated to each sequence using a bioinformatics treatment, prior to Trust4 submission. Therefore, I have a fastq file containing my BCR sequences and a fastq ? file containing my UMI sequences.

1°) If I try to run Trust4 using --noExtraction option, I get following error :

$ ./run-trust4 -t 4 -f Swanson/Swanson_VDJ.fasta --ref Swanson/Swanson_VDJ.fasta \

    -u NucNanoFish/F034_RNA_Bl_d00_NucNvac2_1_assemble-pass.fastq  \
    --UMI **NucNanoFish/F034_RNA_Bl_d00_NucNvac2_1_assemble-pass.umi.fastq** -o NucNanoFish_UMI/NucNanoFish2 \
    --outputReadAssignment --noExtraction
[Wed Sep 14 11:55:23 2022] TRUST4 begins. [Wed Sep 14 11:55:23 2022] SYSTEM CALL: /work/ljouneau/TRUST4/trust4 -t 4 -f Swanson/Swanson_VDJ.fasta -o NucNanoFish_UMI/NucNanoFish2 -u NucNanoFish/F034_RNA_Bl_d00_NucNvac2_1_assemble-pass.fastq --UMI NucNanoFish_UMI/NucNanoFish2_toassemble_umi.fa terminate called after throwing an instance of 'std::logic_error' what(): basic_string::_M_construct null not valid system /work/ljouneau/TRUST4/trust4 -t 4 -f Swanson/Swanson_VDJ.fasta -o NucNanoFish_UMI/NucNanoFish2 -u NucNanoFish/F034_RNA_Bl_d00_NucNvac2_1_assemble-pass.fastq --UMI NucNanoFish_UMI/NucNanoFish2_toassemble_umi.fa failed: 6 at ./run-trust4 line 49.

Please notice the change in the name of the UMI file compared to the one I gave on the command line (in bold). Is it a bug or did I miss something?

2°) Are UMI taken into account without specifying anything for the --barcode option?

3°) I have a question about the UMI and the abundancy counts. If I have two sequences with the same UMI but a distinct CDR3 nucleotide sequence (possibly with the same V/D/J/C calling) are they considered as two different sequences or are they aggregated and counted as 1 sequence?

4°) We are working on a species in which J segments are very similar. Therefore it is often impossible to annotate J segments unambiguously. In this case, how the J of these reads are annotated and to which assemble counts is it assigned?

Thank you in advance for your answers. Best regards

mourisl commented 2 years ago

1) Thank you for letting me know. This is indeed a bug. I will fix it today. 2) The UMI will be taken into account in the abundance estimation. 3) They will be considered as two different sequences. If your UMI is unique, I think you might try the option "--barcode umifile --barcode-level molecule", (--barcode also has the bug when using --noExtraction option. I will let you when I fix it.) 4) TRUST4 still tries to annotate the J genes with the least edit distance, so even the J segments are very similar, it might be fine.

Hope these help. Thank you!

mourisl commented 2 years ago

Just want to make sure "034_RNA_Bl_d00_NucNvac2_1_assemble-pass.fastq" is already candidate BCR reads, not the original RNA-seq?

ljouneau commented 2 years ago

Exactly. I have a presto (https://presto.readthedocs.io) pre-treatement.

ljouneau commented 2 years ago

I am impressed by your responsivness ! Thank you so much

ljouneau commented 2 years ago

BTW, I also noticed that TRUST4 failed to extract the sequences if the reference sequences are in lowercase. The step after fails since there is no sequence to analyzed.

Indeed, some of IMGT reference are specified in lowercase (see by instance https://www.imgt.org/download/V-QUEST/IMGT_V-QUEST_reference_directory/Oncorhynchus_mykiss/IG/IGHJ.fasta)

mourisl commented 2 years ago

Thank you for noticing this. The script to download IMGT sequence "BuildImgtAnnot.pl" converts those lower cases to upper cases. But I can add the conversion within TRUST4 as well.

I'm currently adding the options to the trust-simplerep.pl and trust-airr.pl scripts in case you want the CDR3 summary files by just considering the retained CDR3 in the barcode report file. The default non-barcode output could contain multiple CDR3s from the same UMI.

ljouneau commented 2 years ago

> The script to download IMGT sequence "BuildImgtAnnot.pl" converts those lower cases to upper cases. Ah, sorry, I missed this point.

> But I can add the conversion within TRUST4 as well. I think this is a good idea, since I spent some time on this issue : why human example works perflectly and it does not work on my data ? I thought at the beginning that something was wrong with my fastq, but I realised that the problem was in the reference when I tried to run TRUST4 on my fastq using human reference (biologically it was a nonsense, but it turned out to be technically meaningful)

I am interested in your new version of perl script above. Please letme know when they are available

mourisl commented 2 years ago

I have pushed some updates to the "dev" branch. Could you please check it out, recompile TRUST4, and see whether the results make sense? Thank you.

ljouneau commented 2 years ago

I have installed the dev version and recompile. --noExtraction worked fine now. Thank you.

Today, I have tested two options (the one I used before and the one yoou recommended above) : a°) --UMI NucNanoFish_example/F034_RNA_Bl_d00_NucNvac2_1_assemble-pass.umi.fastq.gz b°) --barcode NucNanoFish_example/F034_RNA_Bl_d00_NucNvac2_1_assemble-pass.umi.fastq.gz --barcode-level molecule

The abundancy counts are really different (see file below). NucNanoFish_report.tar.gz

By instance, for CDR3 GTTTATTATTGTGCAGTAATGGGGGGGGATGCTTTTGACTACTGG : a°) count=298 b°) count=1235

What are the differences of goals between these two commands regarding the guarantee of uniqueness of physical BCR?

The biologist I am working with wanted some clarification about the J annotation :

"We understand that TRUST4 annotate J , choosing the J genes with the least edit distance. However, in a number of cases, we have 2 genes with exactly the same distance. How is the choice made in this case ? is the same gene among the 2 equally distant genes selected consistently if it happens several times ? Are the two genes proposed ? "

Thank you in advance

mourisl commented 2 years ago

1: In a), the UMI is only used for quantification after the CDR3s are assembled. In b), TRUST4 assembled each UMI independently and then select the consensus for each UMI. The CDR3's are summarized based on the result. So theoretically, there should be more count in method a) and the number shouldn't differ too much. I'll look into this issue.

2: TRUST4 calculate the total J gene usage, and if a tie happens, it prioritizes the J gene used more often in the sample. The report file only contains one J genes, but the _annot.fa file contains multiple J gene assignments with alignment similarity in the parenthesis. This could be useful in your analysis.

ljouneau commented 2 years ago

Dr Li,

If you are interested in it, I have developed a perl script restoring ambiguity of annotations for V D J (and I can extend it to C). The rule is the following : Identify the best identity score, and annotated with several annotations if this best identity score is reached for several reference sequences. Just in case …

Best regards Luc

De : Li Song @. Envoyé : mercredi 14 septembre 2022 19:43 À : liulab-dfci/TRUST4 @.> Cc : Luc Jouneau @.>; State change @.> Objet : Re: [liulab-dfci/TRUST4] Question about UMI and and ambiguous J annotation (Issue #154)

I have pushed some updates to the "dev" branch. Could you please check it out, recompile TRUST4, and see whether the results make sense? Thank you.

— Reply to this email directly, view it on GitHubhttps://github.com/liulab-dfci/TRUST4/issues/154#issuecomment-1247101711, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFUI5IQIYBIFK4N6UYXB5LTV6IFC7ANCNFSM6AAAAAAQMNZMXI. You are receiving this because you modified the open/close state.Message ID: @.***>

mourisl commented 2 years ago

Thank you for the script. You can put a link here so other users can use it. I'm also thinking about adding the function to the report script.

mourisl commented 2 years ago

I looked into the inflation of the abundance estimation with method (a). I think it is because reads from other TCR's that partially overlap the CDR3 could be assigned to many contigs, and TRUST4 greedily select one of the targets. So the bulk setting (method (a)) will inflate one of the CDR3s. Method (b) (--barcodeLevel molecule) is much more robust.

ljouneau commented 2 years ago

It seems that inflation of counts occured in b°) method and not in a°).

Here is the link to the perl script restoring ambiguities of matches in *_report.tsv file. get_ambiguity.zip

Best regards

mourisl commented 2 years ago

It's strange that method (b) reports more reads than method (a). In method (b), each barcode/UMI should only be counted exactly once, so the total abundance (column 1) should be less than (a) . Could you please show me the full command you used? I will look into this again. Thank you.

ljouneau commented 2 years ago

a°) run-trust4 -t 4 -f $TRUST4_HOME/trout_reference/Swanson_VDJ.fasta --ref $TRUST4_HOME/trout_reference/Swanson_VDJ.fasta \ -u NucNanoFish_example/F034_RNA_Bl_d00_NucNvac2_1_assemble-pass.fastq.gz \ --UMI NucNanoFish_example/F034_RNA_Bl_d00_NucNvac2_1_assemble-pass.umi.fastq.gz \ -o $output_dir/NucNanoFish \ --outputReadAssignment --noExtraction

b°) run-trust4 -t 4 -f $TRUST4_HOME/trout_reference/Swanson_VDJ.fasta --ref $TRUST4_HOME/trout_reference/Swanson_VDJ.fasta \ -u NucNanoFish_example/F034_RNA_Bl_d00_NucNvac2_1_assemble-pass.fastq.gz \ --barcode NucNanoFish_example/F034_RNA_Bl_d00_NucNvac2_1_assemble-pass.umi.fastq.gz --barcode-level molecule \ -o $output_dir/NucNanoFish \ --outputReadAssignment --noExtraction

mourisl commented 2 years ago

I just pushed some updates to the "dev" branch. Could you please give it a try? You can run TRUST4 with option "--stage 3" to only conduct the post assembly steps to save time. I think the total abundance from method (b) should be roughly equal to the number of rows in the trust_barcode_report.tsv file in the new run.

ljouneau commented 2 years ago

I got the updates and rerun method b°) (--barcode) from scratch. I compared results of method a°) and b°) and I let you examine how it looks like : compare_TRUST4_UMI_barcode.txt It maybe difficult to compare since assembles are not exactly the same : one assemble of a°) method maybe dispatched in several assembles with method b°) (and vice versa).

BTW I have now two warnings when I run trust_simplerep.pl : $ perl $TRUST4_HOME/trust-simplerep.pl $output_dir/NucNanoFish_cdr3.out > $output_dir/NucNanoFish_report.out readline() on closed filehandle FP1 at /save/ljouneau/pierre/5_RACE_analysis/TRUST4/trust-simplerep.pl line 333. readline() on closed filehandle FP1 at /save/ljouneau/pierre/5_RACE_analysis/TRUST4/trust-simplerep.pl line 381.

Maybe I had it also after first dev updates but I did not notice it.

Best regards

mourisl commented 2 years ago

Thank you for summarizing the data. method a) has total 75644 abundances (sum of column 5), and method b) has 26981. Based on method (b), TRUST4 reconstructed receptor information for 26981 molecules. I think this ratio looks right to me.

Could you please share the NucNanoFish_cdr3.out file with me? I could not trigger that warning message on my test data.

ljouneau commented 2 years ago

Could you please share the NucNanoFish_cdr3.out file with me? I could not trigger that warning message on my test data.

Hi,

cdr3.out file can be downloaded here :

https://filesender.renater.fr/?s=download&token=9bc05a89-9684-4619-9cf9-cd8be091b9d7

Best regards

Luc

ljouneau commented 2 years ago

Sorry to bother you again, but I have an assemble for which the sequence is missing in the *_annot.fa file.

Here is the header orphan, flanked by previous and next sequence : annot_missing_sequences.zip

Nothing seems special (V,D,J seems to be correctly annotated).

mourisl commented 2 years ago

Thank you for sharing the files. It is strange that I did not receive the perl warning message on our server. I will try it on another system later.

For the missing assemble ids, the report file will coalesce the sequences sharing the same V, J, C genes and CDR3, and only select one of the assembled contig as the representative. Do you mean the CDR3 sequence is also missing in the final report at all? Thank you.

mourisl commented 2 years ago

I got the updates and rerun method b°) (--barcode) from scratch. I compared results of method a°) and b°) and I let you examine how it looks like : compare_TRUST4_UMI_barcode.txt It maybe difficult to compare since assembles are not exactly the same : one assemble of a°) method maybe dispatched in several assembles with method b°) (and vice versa).

BTW I have now two warnings when I run trust_simplerep.pl : $ perl $TRUST4_HOME/trust-simplerep.pl $output_dir/NucNanoFish_cdr3.out > $output_dir/NucNanoFish_report.out readline() on closed filehandle FP1 at /save/ljouneau/pierre/5_RACE_analysis/TRUST4/trust-simplerep.pl line 333. readline() on closed filehandle FP1 at /save/ljouneau/pierre/5_RACE_analysis/TRUST4/trust-simplerep.pl line 381.

Maybe I had it also after first dev updates but I did not notice it.

Best regards

I think the warning happens when the file "$output_dir/NucNanoFish_cdr3.out " does not exist. Could you please check again to make sure $output_dir/NucNanoFish_cdr3.out is on the right path in this command? Thank you.

ljouneau commented 2 years ago

I think the warning happens when the file "$output_dir/NucNanoFish_cdr3.out " does not exist. Could you please check again to make sure $output_dir/NucNanoFish_cdr3.out is on the right path in this command? Thank you.

Yes you are right. Sorry for the mistake

ljouneau commented 2 years ago

For the missing assemble ids, the report file will coalesce the sequences sharing the same V, J, C genes and CDR3, and only select one of the assembled contig as the representative. Do you mean the CDR3 sequence is also missing in the final report at all? Thank you.

CDR3 sequence is OK in the final report for this assembly. The sequence is missing only in the _annot.fa fasta file, but I got it in _final.out file

mourisl commented 2 years ago

CDR3 sequence is OK in the final report for this assembly. The sequence is missing only in the _annot.fa fasta file, but I got it in _final.out file

I see! This could be a serious bug. Could you please share the assemble11973, assemble11974, assemble11975 sequences and theirs count weight rows (totally 6 rows with the header each) from the _final.out file? It's also strange that the assemble id for assemble11975 suddenly bumped up to assemble991975.

Thank you.

ljouneau commented 2 years ago

I think this is a false alarm. I remembered that I manually added some sequences tagged assemble99xxxx to see how they were annotated. Probably I introduced some mistake in the fastq file during this step. I got back to the original data and I do not miss any sequences in the _annot.fa file now

Sorry for the trouble. Best regards

mourisl commented 2 years ago

Thank you! The annotator can annotate fasta files without the four weight rows. You can run annotator with the option "--fasta" for that.

ljouneau commented 1 year ago

Thank you for your answer,

1: In a), the UMI is only used for quantification after the CDR3s are assembled. In b), TRUST4 assembled each UMI independently and then select the consensus for each UMI. The CDR3's are summarized based on the result. So theoretically, there should be more count in method a) and the number shouldn't differ too much. I'll look into this issue.

If you want, I can send to you the data I used to test TRUST4.

Best regards

Luc

mourisl commented 1 year ago

Thank you. I have checked other data. The large difference is mainly due to the ambiguous alignments from the single-end data. The "--barcode molecule" option will give you a much more accurate abundance estimation.

liulab-dfci / TRUST4

Question about UMI and and ambiguous J annotation #154