how the VDJs are assigned to contigs

saramoein372 commented 1 year ago

Hello,

I have a question about how the VDJs are assigned to the contigs. I have a trust4_cdr3.out, and I can see for example for malignant cells in my data, there are multiple VDJs. How is this possible?

I think knowing how the VDJs are assigned helps me to figure out the reason of this issue in my data.

Thanks, Sara

mourisl commented 1 year ago

We align the reference sequences of IMGT to the assembled contigs to identify the VDJC genes and CDR3 region. trust4_cdr3.out contains many partial CDR3s, which are likely to be false positive or partially expressed V, J genes. The trust_barcode_report.tsv file is cleaner.

saramoein372 commented 1 year ago

Thanks Li.

To get the VDJs , the full sequences are used. Correct?

Thanks, Sara

On Mon, Nov 28, 2022 at 12:56 PM Li Song @.***> wrote:

We align the reference sequences of IMGT to the assembled contigs to identify the VDJC genes and CDR3 region. trust4_cdr3.out contains many partial CDR3s, which are likely to be false positive or partially expressed V, J genes. The trust_barcode_report.tsv file is cleaner.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1329514497, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONVEOLDIAYHKMQ2BDSDWKTW3XANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 1 year ago

And what can be the possible reason that cancer cells have different VDJs? We expect the cancer cells having the same VDJs

On Mon, Nov 28, 2022 at 3:19 PM Sara Moien @.***> wrote:

Thanks Li.

To get the VDJs , the full sequences are used. Correct?

Thanks, Sara

On Mon, Nov 28, 2022 at 12:56 PM Li Song @.***> wrote:

We align the reference sequences of IMGT to the assembled contigs to identify the VDJC genes and CDR3 region. trust4_cdr3.out contains many partial CDR3s, which are likely to be false positive or partially expressed V, J genes. The trust_barcode_report.tsv file is cleaner.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1329514497, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONVEOLDIAYHKMQ2BDSDWKTW3XANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

Not the full VDJ (5' of V to 3' of J) sequence, but the full CDR3 sequence should be used.

It could be the other VDJs are from non-productive recombinations. Is your data lymphoma or myeloma?

saramoein372 commented 1 year ago

just to make sure: The VDJs are made based on full CDR3? The full CDR3. Is column 9 in trust4_cdr3_out? the full CDR3? if not, which column in which file is showing the full CDR3?

On Mon, Nov 28, 2022 at 3:32 PM Li Song @.***> wrote:

Not the full VDJ (5' of V to 3' of J) sequence, but the full CDR3 sequence should be used.

It could be the other VDJs are from non-productive recombinations. Is your data lymphoma or myeloma?

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1329728018, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONUXW6Y45Y7HWO7S55DWKUJFJANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

The VDJ annotation is based on the assembled contigs from the trust_final.out file. It could contain partial VDJ sequences, such as from middle of V gene to the first 10bp of J gene. If the contig only finds a part of CDR3, you will still get V or J gene assignment and the partial CDR3 sequence. The partial CDR3 will be marked in the CDR3_score column in the trust_cdr3.out file (the 10th column, the column after CDR3 sequence), where score 0 means partial CDR3, and non-0 score means full CDR3.

saramoein372 commented 1 year ago

Thanks Li. So, where can I get the VDJ assignment from the assembled contigs in trust_final.out file?

On Mon, Nov 28, 2022 at 4:09 PM Li Song @.***> wrote:

The VDJ annotation is based on the assembled contigs from the trust_final.out file. It could contain partial VDJ sequences, such as from middle of V gene to the first 10bp of J gene. If the contig only finds a part of CDR3, you will still get V or J gene assignment and the partial CDR3 sequence. The partial CDR3 will be marked in the CDR3_score column in the trust_cdr3.out file (the 10th column, the column after CDR3 sequence), where score 0 means partial CDR3, and non-0 score means full CDR3.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1329765605, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONW7UXAQ47S6SQJKAPTWKUNOZANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

The VDJ annotations are in the trust_annot.fa file. The sequence id is the link to the first column of the trust_cdr3.out file.

saramoein372 commented 1 year ago

Thanks. I think I already asked this question in my previous emails: but I can not find the V-D-J genes in the trust_annot.fa file.

Can you help me to understand this?

On Mon, Nov 28, 2022 at 4:25 PM Li Song @.***> wrote:

The VDJ annotations are in the trust_annot.fa file. The sequence id is the link to the first column of the trust_cdr3.out file.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1329781387, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONQCSXEUHGNJXPAG72LWKUPMRANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

It should be in the sequence header fields in the fasta files. Some sequences could not be annotated with V, D, J genes due to assembly artifacts, but most of them should have some annotation. Could you please share a few lines in your trust_annot.fa file?

saramoein372 commented 1 year ago

Sure:

CCGTACTTCAGGTAAA_0 134 11.88 * IGKC(523):(0-133):(2-135):100.00 CDR1(0-0):0.00=null CDR2(0-0):0.00=null CDR3(0-0):0.00=null

ACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGC

TGACGGCTCGATAGAA_1 255 47.62 * IGKC(523):(19-254):(0-235):100.00 CDR1(0-0):0.00=null CDR2(0-0):0.00=null CDR3(0-0):0.00=null

ACCAAGCTGGAGATCAGACGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGCCCTCCAATCGGGTAACTCCCAGGAGAGTGTCACAGAGCAGGACAGCAAGGACAGCACCTACAGCCTCAGCAGCACCCTGACGCTGAGCAAAGCAGACTAC

ACGGCCATCGGAGGTA_2 326 534.63 * IGKC(523):(85-325):(0-240):100.00 CDR1(0-0):0.00=null CDR2(0-0):0.00=null CDR3(0-0):0.00=null

TTGACTGATCAGGACTCCTCAGTTCACCTTCTCACAATGAGGCTCCCTGCTCAGCTCCTGGGGCTGCTAATGCTCTGGGTCTCTGGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGCCCTCCAATCGGGTAACTCCCAGGAGAGTGTCACAGAGCAGGACAGCAAGGACAGCACCTACAGCCTCAGCAGCACCCTGACGCTGAGCAAAGCAGACTACGAGAA

ATCGAGTTCAGTGCAT_3 220 689.52 * IGKC(523):(83-219):(0-136):100.00 CDR1(0-0):0.00=null CDR2(0-0):0.00=null CDR3(0-0):0.00=null

GACTGATCAGGACTCCTCAGTTCACCTTCTCACAATGAGGCTCCCTGCTCAGCTCCTGGGGCTGCTAATGCTCTGGGTCTCTGGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGCC

CACCTTGTCCTTTACA_4 343 1783.30 IGKJ1*01(38):(64-101):(0-37):100.00 IGKC(523):(102-342):(0-240):100.00 CDR1(0-0):0.00=null CDR2(0-0):0.00=null CDR3(0-0):0.00=null

GACTCTGTTCCCCTTTGGTGAGAAGGGTTTTTGTTCAGCAAGACAATGGAGAGCTCTCACTGTGGTGGACGTTCGGCCAAGGGACCAAGGTGGAAATCAAACGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGCCCTCCAATCGGGTAACTCCCAGGAGAGTGTCACAGAGCAGGACAGCAAGGACAGCACCTACAGCCTCAGCAGCACCCTGACGCTGAGCAAAGCAGACTACGAGAA

AGCAGCCCATCATCCC_5 324 474.12 * IGKC(523):(83-323):(0-240):100.00 CDR1(0-0):0.00=null CDR2(0-0):0.00=null CDR3(0-0):0.00=null

GACTGATCAGGACTCCTCAGTTCACCTTCTCACAATGAGGCTCCCTGCTCAGCTCCTGGGGCTGCTAATGCTCTGGGTCTCTGGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGCCCTCCAATCGGGTAACTCCCAGGAGAGTGTCACAGAGCAGGACAGCAAGGACAGCACCTACAGCCTCAGCAGCACCCTGACGCTGAGCAAAGCAGACTACGAGAA

CTTTGCGAGAAGCCCA_6 397 181.63 IGKV2-2801(302):(0-215):(79-294):93.06,IGKV2D-2801(302):(0-215):(79-294):93.06

IGKJ5*01(38):(221-258):(0-37):94.74 IGKC(523):(259-396):(0-137):100.00 CDR1(0-0):0.00=null CDR2(83-91):100.00=TTGGGTTCT CDR3(197-230):83.33=TGCATACAAGGTCTACAAATTTCCGATCCCCTTC

AGAGCCTCCTAAATGTTAATCGATACAACTCTTTGGATTGGTACCTGCAGAAGCCAGGGCAGTCTCCACAGTTCCTGATCTATTTGGGTTCTAATCGGGCCTCCGGGGTCCCTGACAGGTTCAGTGGCAGTGGATCAGGCACAGAGTTCACACTGAAAATCAGCAGAGCGGAGGCTGAGGATGTTGGGATTTATTTGTGCATACAAGGTCTACAAATTTCCGATCCCCTTCGGCCAAGGGACACGACTGGAGACTAAACGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGCCC

CCGGTAGGTATCTGCA_7 326 753.13 * IGKC(523):(85-325):(0-240):100.00 CDR1(0-0):0.00=null CDR2(0-0):0.00=null CDR3(0-0):0.00=null

TTGACTGATCAGGACTCCTCAGTTCACCTTCTCACAATGAGGCTCCCTGCTCAGCTCCTGGGGCTGCTAATGCTCTGGGTCTCTGGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGCCCTCCAATCGGGTAACTCCCAGGAGAGTGTCACAGAGCAGGACAGCAAGGACAGCACCTACAGCCTCAGCAGCACCCTGACGCTGAGCAAAGCAGACTACGAGAA

CACTCCACATGGTCTA_8 326 995.44 * IGKC(523):(85-325):(0-240):100.00 CDR1(0-0):0.00=null CDR2(0-0):0.00=null CDR3(0-0):0.00=null

TTGACTGATCAGGACTCCTCAGTTCACCTTCTCACAATGAGGCTCCCTGCTCAGCTCCTGGGGCTGCTAATGCTCTGGGTCTCTGGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGCCCTCCAATCGGGTAACTCCCAGGAGAGTGTCACAGAGCAGGACAGCAAGGACAGCACCTACAGCCTCAGCAGCACCCTGACGCTGAGCAAAGCAGACTACGAGAA

TTCTCAAGTATGAAAC_9 326 377.11 * IGKC(523):(85-325):(0-240):100.00 CDR1(0-0):0.00=null CDR2(0-0):0.00=null CDR3(0-0):0.00=null

TTGACTGATCAGGACTCCTCAGTTCACCTTCTCACAATGAGGCTCCCTGCTCAGCTCCTGGGGCTGCTAATGCTCTGGGTCTCTGGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGCCCTCCAATCGGGTAACTCCCAGGAGAGTGTCACAGAGCAGGACAGCAAGGACAGCACCTACAGCCTCAGCAGCACCCTGACGCTGAGCAAAGCAGACTACGAGAA

On Mon, Nov 28, 2022 at 4:55 PM Li Song @.***> wrote:

It should be in the sequence header fields in the fasta files. Some sequences could not be annotated with V, D, J genes due to assembly artifacts, but most of them should have some annotation. Could you please share a few lines in your trust_annot.fa file?

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1329806170, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONXLPIQZ7UTSGMTGZCTWKUS5FANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

Many of the contigs are from the IGKC genes, so they might be just expressed without recombination.

For the contig CTTTGCGAGAAGCCCA_6 , you can see all the V, D, J, C assignments and the complete CDR3 sequence in the header.

saramoein372 commented 1 year ago

Hi Li,

Thank you so much for your reply.

I tried to parse my trust_annot.fa files to see which of the assembled reads have VDJ recombinations. After that, I could see that for my malignant cells, some of them have different VDJs. How and why can this happen?

Do you have any idea?

Thank you!

On Mon, Nov 28, 2022 at 5:00 PM Li Song @.***> wrote:

Many of the contigs are from the IGKC genes, so they might be just expressed without recombination.

For the contig CTTTGCGAGAAGCCCA_6 , you can see all the V, D, J, C assignments and the complete CDR3 sequence in the header.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1329810259, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONRT5D3QOFIYPEXLPMTWKUTRRANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

Are those VDJs with complete CDR3s? I'm curious why you want to explore the sequences in the annot.fa file, the results in trust_barcode_report.tsv or trust_barcode_airr.tsv files are cleaner.

saramoein372 commented 1 year ago

Thanks Li. You asked: "I'm curious why you want to explore the sequences in the annot.fa file, the results in trust_barcode_report.tsv or trust_barcode_airr.tsv files are cleaner". Because I need a file that has the most number of assembled reads with the VDJ annotations. The issue we have is that we see the cells in the Bclone are having different VDJ assignments and I am trying to see if there are other assembled reads for each cell that have the unified VDJs with what we have in the Bclone.

Thanks for the comment about checking the score of CDR3 to see if they are partial or full. Just to confirm, the column names for trust4_cdr3_out are: consensus_id index_within_consensus V_gene D_gene J_gene C_gene CDR1 CDR2 CDR3 CDR3_score read_fragment_count CDR3_germline_similarity full_length_assembly

Correct?

On Wed, Nov 30, 2022 at 4:16 PM Li Song @.***> wrote:

Are those VDJs with complete CDR3s? I'm curious why you want to explore the sequences in the annot.fa file, the results in trust_barcode_report.tsv or trust_barcode_airr.tsv files are cleaner.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1332744743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONUSAW4EMIQCNUGUFOLWK6737ANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

Yes, this is the current _cdr3.out file format.

For the different VDJ assignments, if the percentage of cells with other VDJ is quite low, it is more likely due to sequencing artifacts, such as contamination.

saramoein372 commented 1 year ago

Thank Li. You wrote: For the different VDJ assignments, if the percentage of cells with other VDJ is quite low, it is more likely due to sequencing artifacts, such as contamination

From which file and which column I can obtain the percentage?

On Wed, Nov 30, 2022, 5:26 PM Li Song @.***> wrote:

Yes, this is the current _cdr3.out file format.

For the different VDJ assignments, if the percentage of cells with other VDJ is quite low, it is more likely due to sequencing artifacts, such as contamination.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1332816486, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONQZG23ZVFHXQSM3DGLWK7ICFANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

It's not directly in TRUST4's results. Since you are concerned that there are multiple VDJs in the malignant cell population, I'm wondering do you expect them to have the same VDJ? If so, there should be a dominant clonotype in this population, and if the fraction of cell` with the dominant clonotype is really high, then you don't need to worry. If you see a more even distribution, It's possible that there are multiple tumor clones in the sample depending on the disease type.

saramoein372 commented 1 year ago

I see a more dominant vdj in my clone. Does that seem fine to you? And how we can justify that (low percentage ) some of the cells in clone have different vdj?

On Wed, Nov 30, 2022, 8:07 PM Li Song @.***> wrote:

It's not directly in TRUST4's results. Since you are concerned that there are multiple VDJs in the malignant cell population, I'm wondering do you expect them to have the same VDJ? If so, there should be a dominant clonotype in this population, and if the fraction of cell` with the dominant clonotype is really high, then you don't need to worry. If you see a more even distribution, It's possible that there are multiple tumor clones in the sample depending on the disease type.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1332999155, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONVIHYG7QLQEM6OWCOLWK726VANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 1 year ago

Hi Li,

I have few questions: 1- The file trust4_annot.fa is composed of the assembled reads. Correct? 2- The VDJ assignments are based on the sequences in trust4_annot.fa. Correct? 3- In trust4_annot.fa VDJs, I could see some of the VDJ assignments are different from when I put the sequence in igblast tool. Can it be because of the version of the IMGT that trust4 is using? How can I justify this?

Thank you, Sara

On Wed, Nov 30, 2022 at 8:35 PM Sara Moien @.***> wrote:

I see a more dominant vdj in my clone. Does that seem fine to you? And how we can justify that (low percentage ) some of the cells in clone have different vdj?

On Wed, Nov 30, 2022, 8:07 PM Li Song @.***> wrote:

It's not directly in TRUST4's results. Since you are concerned that there are multiple VDJs in the malignant cell population, I'm wondering do you expect them to have the same VDJ? If so, there should be a dominant clonotype in this population, and if the fraction of cell` with the dominant clonotype is really high, then you don't need to worry. If you see a more even distribution, It's possible that there are multiple tumor clones in the sample depending on the disease type.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1332999155, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONVIHYG7QLQEM6OWCOLWK726VANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

I see a more dominant vdj in my clone. Does that seem fine to you? And how we can justify that (low percentage ) some of the cells in clone have different vdj?

There is no consensus on that. I think if the dominant vdj can be above 90%, it is quite unique. Note that the malignant cell cluster might not be all malignant cells as clustering artifacts, so allowing some differences are quite normal.

1- The file trust4_annot.fa is composed of the assembled reads. Correct?

The are assembled contigs, not the reads.

2- The VDJ assignments are based on the sequences in trust4_annot.fa. Correct?

Right

3- In trust4_annot.fa VDJs, I could see some of the VDJ assignments are different from when I put the sequence in igblast tool. Can it be because of the version of the IMGT that trust4 is using? How can I justify this?

This could also be algorithmic differences, such as different scores on mismatches or indels.

saramoein372 commented 1 year ago

Thanks Li.

Can i ask which code, or tutorial I available to know the strategy for finding vdj assignment?

On Thu, Dec 1, 2022, 1:35 PM Li Song @.***> wrote:

I see a more dominant vdj in my clone. Does that seem fine to you? And how we can justify that (low percentage ) some of the cells in clone have different vdj?

There is no consensus on that. I think if the dominant vdj can be above 90%, it is quite unique. Note that the malignant cell cluster might not be all malignant cells as clustering artifacts, so allowing some differences are quite normal.

1- The file trust4_annot.fa is composed of the assembled reads. Correct?

The are assembled contigs, not the reads.

2- The VDJ assignments are based on the sequences in trust4_annot.fa. Correct?

Right

3- In trust4_annot.fa VDJs, I could see some of the VDJ assignments are different from when I put the sequence in igblast tool. Can it be because of the version of the IMGT that trust4 is using? How can I justify this?

This could also be algorithmic differences, such as different scores on mismatches or indels.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1334185094, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONRSOZCGEUR5ADJ2IKLWLDVVXANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 1 year ago

Li,

Basically, I need to know among 3 significant V_gene and 3 significant J_gene how these genes are scored? For example, for my sequence I get 3 significant V_genes: V1401 , V1402 and V1403; and for J_genes I get 3 significant J_genes: J101, J201 and J301

But in the final trust4 output I see the assignments is : vgene: V1401 and jgene: J101

How this one is selected? Thanks

On Thu, Dec 1, 2022 at 1:40 PM Sara Moien @.***> wrote:

Thanks Li.

Can i ask which code, or tutorial I available to know the strategy for finding vdj assignment?

On Thu, Dec 1, 2022, 1:35 PM Li Song @.***> wrote:

I see a more dominant vdj in my clone. Does that seem fine to you? And how we can justify that (low percentage ) some of the cells in clone have different vdj?

There is no consensus on that. I think if the dominant vdj can be above 90%, it is quite unique. Note that the malignant cell cluster might not be all malignant cells as clustering artifacts, so allowing some differences are quite normal.

1- The file trust4_annot.fa is composed of the assembled reads. Correct?

The are assembled contigs, not the reads.

2- The VDJ assignments are based on the sequences in trust4_annot.fa. Correct?

Right

3- In trust4_annot.fa VDJs, I could see some of the VDJ assignments are different from when I put the sequence in igblast tool. Can it be because of the version of the IMGT that trust4 is using? How can I justify this?

This could also be algorithmic differences, such as different scores on mismatches or indels.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1334185094, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONRSOZCGEUR5ADJ2IKLWLDVVXANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

In the annot file, there is an alignment identifity (the percentage of matched bases) in the parenthesis, and I'm picking the highest one for downstream representation. If there is a tie, TRUST4 will select the gene used most frequently across the data set.

saramoein372 commented 1 year ago

Hi Li,

Thank you so much! I am just not clear in understanding the scores. Would you please provide more details with the below example from annot file = (where we can see the matched percentages)? CTTTGCGAGAAGCCCA_6 397 181.63 IGKV2-2801(302):(0-215):(79-294):93.06,IGKV2D-2801(302):(0-215):(79-294):93.06

IGKJ5*01(38):(221-258):(0-37):94.74 IGKC(523):(259-396):(0-137):100.00 CDR1(0-0):0.00=null CDR2(83-91):100.00=TTGGGTTCT CDR3(197-230):83.33=TGCATACAAGGTCTACAAATTTCCGATCCCCTTC AGAGCCTCCTAAATGTTAATCGATACAACTCTTTGGATTGGTACCTGCAGAAGCCAGGGCAGTCTCCACAGTTCCTGATCTATTTGGGTTCTAATCGGGCCTCCGGGGTCCCTGACAGGTTCAGTGGCAGTGGATCAGGCACAGAGTTCACACTGAAAATCAGCAGAGCGGAGGCTGAGGATGTTGGGATTTATTTGTGCATACAAGGTCTACAAATTTCCGATCCCCTTCGGCCAAGGGACACGACTGGAGACTAAACGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGCCC

On Thu, Dec 1, 2022 at 4:56 PM Li Song @.***> wrote:

In the annot file, there is an alignment identifity (the percentage of matched bases) in the parenthesis, and I'm picking the highest one for downstream representation. If there is a tie, TRUST4 will select the gene used most frequently across the data set.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1334495730, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONQTQK5KWTQBZX5KVWLWLENHLANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 1 year ago

In the below example, GCGACCAGTAGCTCCG_60392 90 0.18 IGLV2-3401(297):(0-57):(224-281):98.28,IGLV2-NL101(297):(0-57):(224-281):98.28,IGLV2-14*01(297):(0-58):(224-282):96.61

IGLJ201(38):(76-89):(7-20):100.00,IGLJ301(38):(76-89):(7-20):100.00 * CDR1(0-0):0.00=null CDR2(0-0):0.00=null CDR3(43-78):100.00=TGCAGCTCATATGCAACACGTAACACTGTCCTCTTC GACCATCTCTGGGCTCCAGCCTGAGGACGAGGCTGATTATTACTGCAGCTCATATGCAACACGTAACACTGTCCTCTTCGGCGGAGGGAC

I expect this cell to be part of IGLV2-1401 v-gene. But the trust4 toll selected IGLV2-3401. I can see the IGLV2-1401 in the list of significant genes. But I need to find a way to justify the IGLV2-1401 as the V-gene. Do you have any comment?

On Fri, Dec 2, 2022 at 12:26 PM Sara Moien @.***> wrote:

Hi Li,

Thank you so much! I am just not clear in understanding the scores. Would you please provide more details with the below example from annot file = (where we can see the matched percentages)? CTTTGCGAGAAGCCCA_6 397 181.63 IGKV2-2801(302):(0-215):(79-294):93.06,IGKV2D-2801(302):(0-215):(79-294):93.06

IGKJ5*01(38):(221-258):(0-37):94.74 IGKC(523):(259-396):(0-137):100.00 CDR1(0-0):0.00=null CDR2(83-91):100.00=TTGGGTTCT CDR3(197-230):83.33=TGCATACAAGGTCTACAAATTTCCGATCCCCTTC

AGAGCCTCCTAAATGTTAATCGATACAACTCTTTGGATTGGTACCTGCAGAAGCCAGGGCAGTCTCCACAGTTCCTGATCTATTTGGGTTCTAATCGGGCCTCCGGGGTCCCTGACAGGTTCAGTGGCAGTGGATCAGGCACAGAGTTCACACTGAAAATCAGCAGAGCGGAGGCTGAGGATGTTGGGATTTATTTGTGCATACAAGGTCTACAAATTTCCGATCCCCTTCGGCCAAGGGACACGACTGGAGACTAAACGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGCCC

On Thu, Dec 1, 2022 at 4:56 PM Li Song @.***> wrote:

In the annot file, there is an alignment identifity (the percentage of matched bases) in the parenthesis, and I'm picking the highest one for downstream representation. If there is a tie, TRUST4 will select the gene used most frequently across the data set.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1334495730, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONQTQK5KWTQBZX5KVWLWLENHLANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

The recovered V gene in this sequence is fairly short, so it could be misassigned to other V genes, especially with somatic hypermutations and variations within the CDR3 joining site. I think one way to alleviate this is to ignore V gene assignment at all, and just use CDR3 as the anchor.

In this case, IgBlast identified IGLV2-14 as the V gene while TRUST4 selected IGLV2-34. But in other cases, it could be the other way around.

saramoein372 commented 1 year ago

Hi Li,

Thank you so much.

I have some questions about the file trsut4_annot.fa

1- This file contains the assembled reads with VDJ assignments. Correct? 2- How these reads are assembled? Is there any filtering applied to obtain this file? 3- Can we say some of the raw reads are not reported in the trust_annot.fa? If yes, can we conclude that the assembled reads in this file are made based on good reads? 4- in this file, per each cell there are multiple assembled reads. How trust4 report the best assembled-reads among multiple of them in trust_annot.fa? I remember you have written before about a score. But I can not see any score in the trust/-annot.fa file. Can you explain the strategy of assigning the correct VDJ? I can see for a cell barcode for each assembled_reads there are up to 3 V and up to 3 J genes. How among these multiple genes the correct Vgene and correct J gene are selected?

5- Among multiple assembled reads, how the best-assembled read is selected?

Thank you, Sara

5-

On Fri, Dec 2, 2022 at 4:13 PM Li Song @.***> wrote:

The recovered V gene in this sequence is fairly short, so it could be misassigned to other V genes, especially with somatic hypermutations and variations within the CDR3 joining site. I think one way to alleviate this is to ignore V gene assignment at all, and just use CDR3 as the anchor.

In this case, IgBlast identified IGLV2-14 as the V gene while TRUST4 selected IGLV2-34. But in other cases, it could be the other way around.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1335850158, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONX74MIPETWEYAHUG73WLJRABANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

This file is the consensus of the assembled contig, not the reads themselves. The reads used in assembling these contigs are in the trust_assembled_reads.fa file.
They are assembled using the overlap-and-extension scheme. Essentially, it looks for large overlaps between two reads and merges them into a contig. You can refer to the TRUST4 paper.
The sequences in the annotation file are of high quality. 4 and 5. TRUST4 will realign the reads to the assembled contig again to calculate the abundances for each CDR3. It then selects the most abundance CDR3 (supported by the most number of reads or UMIs) as the representative chain for this cell. For the V and J gene assignment, TRUST4 selects the gene with the least discrepancy to the contig. For the cell, TRUST4 use the results from the contig corresponding to the representative CDR3.

Hope this helps.

saramoein372 commented 1 year ago

Li, Thank you so much for the explanation. That really helped. You wrote: "This file is the consensus of the assembled contig, not the reads themselves. The reads used in assembling these contigs are in the trust_assembled_reads.fa file." I have two questions: 1- would you please write your definition for the "consensus of the assembled contig"? 2- How these consensus of the assembled contig are generated?

Also I have two more questions: 3- Is there any limit for the number of the consensus of the contigs? 4- I thought we can't have more than 3 or 4 contigs in our final data. Is this correct?

Thanks, Sara

On Fri, Dec 9, 2022, 12:13 AM Li Song @.***> wrote:

This file is the consensus of the assembled contig, not the reads themselves. The reads used in assembling these contigs are in the trust_assembled_reads.fa file.

They are assembled using the overlap-and-extension scheme. Essentially, it looks for large overlaps between two reads and merges them into a contig. You can refer to the TRUST4 paper.

The sequences in the annotation file are of high quality. 4 and 5. TRUST4 will realign the reads to the assembled contig again to calculate the abundances for each CDR3. It then selects the most abundance CDR3 (supported by the most number of reads or UMIs) as the representative chain for this cell. For the V and J gene assignment, TRUST4 selects the gene with the least discrepancy to the contig. For the cell, TRUST4 use the results from the contig corresponding to the representative CDR3.

Hope this helps.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1343859695, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONVHMUQLAWLGJ4MYNRDWMK5XLANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

1,2- It's a bit technical. Essentially it uses the nucleotide supported by the most number of reads at each position to accommodate sequencing errors and SHMs. I would recommend reading the TRUST4 manuscript.

Consensus is for one contig.
We can have multiple contigs for each cell. This could arise from sequencing artifacts (too much sequencing errors, doublets, contaminations) or algorithmic artifacts.

saramoein372 commented 1 year ago

Thank you so much, Li!

I have read the manuscript, but it seems I still need your support. Appreciate it!

Also, on more questions:

1- For the VDJ assignments, you had explained before about some scores. In which file, can I find those scores? From the trust_annot.fa an example is this (maybe it helps that you explain the scores):

CTCGGGAGTCATACTG_11 480 3975.58 IGKV3D-1102(287):(0-203):(80-283):97.06,IGKV3-1101(287):(0-202):(80-282):97.04

IGKJ301(38):(204-238):(3-37):100.00,IGKJ201(39):(203-238):(3-38):83.33 IGKC(523):(239-479):(0-240):100.00 CDR1(0-0):0.00=null CDR2(67-75):100.00=GATGCATCC CDR3(181-210):83.33=TGTCAGCAGCGTAGCGACTGGCACACTTTC GAGTGTTAGCAGCTACTTAGCCTGGTACCAGCAGAAACCTGGCCAGGCTCCCAGGCTCCTCATCTATGATGCATCCAACAGGGCCACTGGCATCCCAGCCAGGTTCAGTGGCAGTGGGTCTGGGACGGACTTCACTCTCACCATCAACAGCCTAGAGCCTGAAGATTTTGCAGTCTATTTCTGTCAGCAGCGTAGCGACTGGCACACTTTCGGCCCTGGGACCAAAGTGGATATCAAACGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGCCCTCCAATCGGGTAACTCCCAGGAGAGTGTCACAGAGCAGGACAGCAAGGACAGCACCTACAGCCTCAGCAGCACCCTGACGCTGAGCAAAGCAGACTACGAGAA

2- And now I need to run the trust4 for my TCR data. I am using the below command for my BCR, but not sure how I should modify it to run for BCR.

run-trust4 -f hg38_bcrtcr.fa -t 8 --ref human_IMGT+C.fa -u HL6_s1_R2.fastq.gz --barcode HL6_s1_R1.fastq.gz --barcodeRange 0 15 + --barcodeWhitelist HL1_GEX_BCR_TCR/737K-august-2016_barcodes.txt --UMI HL6_s1_R1.fastq.gz --umiRange 16 27 + -o FR2_s1 --od out_s1 --repseq

Thank you so much!

On Fri, Dec 9, 2022 at 10:03 AM Sara Moien @.***> wrote:

Li, Thank you so much for the explanation. That really helped. You wrote: "This file is the consensus of the assembled contig, not the reads themselves. The reads used in assembling these contigs are in the trust_assembled_reads.fa file." I have two questions: 1- would you please write your definition for the "consensus of the assembled contig"? 2- How these consensus of the assembled contig are generated?

Also I have two more questions: 3- Is there any limit for the number of the consensus of the contigs? 4- I thought we can't have more than 3 or 4 contigs in our final data. Is this correct?

Thanks, Sara

On Fri, Dec 9, 2022, 12:13 AM Li Song @.***> wrote:

This file is the consensus of the assembled contig, not the reads themselves. The reads used in assembling these contigs are in the trust_assembled_reads.fa file.

They are assembled using the overlap-and-extension scheme. Essentially, it looks for large overlaps between two reads and merges them into a contig. You can refer to the TRUST4 paper.

The sequences in the annotation file are of high quality. 4 and 5. TRUST4 will realign the reads to the assembled contig again to calculate the abundances for each CDR3. It then selects the most abundance CDR3 (supported by the most number of reads or UMIs) as the representative chain for this cell. For the V and J gene assignment, TRUST4 selects the gene with the least discrepancy to the contig. For the cell, TRUST4 use the results from the contig corresponding to the representative CDR3.

Hope this helps.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1343859695, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONVHMUQLAWLGJ4MYNRDWMK5XLANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

1: For example, IGKV3D-11*02(287):(0-203):(80-283):97.06, here 97.06 is the score. 2: There is no need to change. The command line to run BCR-seq and TCR-seq is the same.

saramoein372 commented 1 year ago

Thank you so much, Li!

In your previous email, you wrote: "TRUST4 will realign the reads to the assembled contig again to calculate the abundances for each CDR3. It then selects the most abundance CDR3 (supported by the most number of reads or UMIs) as the representative chain for this cell." When you say "reads", you mean the "assembled reads" from "assembeled_read.fa file". correct?

On Fri, Dec 9, 2022 at 1:14 PM Li Song @.***> wrote:

1: For example, IGKV3D-11*02(287):(0-203):(80-283):97.06, here 97.06 is the score. 2: There is no need to change. The command line to run BCR-seq and TCR-seq is the same.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1344615252, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONTQSC7I7TATH7IKS3LWMNZJPANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

Right.

saramoein372 commented 1 year ago

Thank you Li.

Just related to TCR, how I should filter my TCR results? Any reads in CDR3.out that is on TR* is TCR?

Sorry, if my questions are not that accurate. My background is computer science.

Thanks, Sara

On Mon, Dec 12, 2022 at 10:24 AM Li Song @.***> wrote:

Right.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1346694370, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONRWNUVZ34RMXU3ZU43WM47SLANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 1 year ago

Sorry, please ignore my last email Li. Thank you!

On Mon, Dec 12, 2022 at 11:43 AM Sara Moien @.***> wrote:

Thank you Li.

Just related to TCR, how I should filter my TCR results? Any reads in CDR3.out that is on TR* is TCR?

Sorry, if my questions are not that accurate. My background is computer science.

Thanks, Sara

On Mon, Dec 12, 2022 at 10:24 AM Li Song @.***> wrote:

Right.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1346694370, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONRWNUVZ34RMXU3ZU43WM47SLANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 1 year ago

Hi Li,

I have some questions again:

In the file trust4_cdr3.out I have multiple contigs for one cell barcodes:

1- Why do we have multiple contigs per 1 cell barcode in trust4_cdr3.out? 2- Trust4_cdr3.out is my input for clustering code and the output is clone_ids in clone.out. How the contigs are filtered in the Trust4_cdr3.out and we get smaller number of contigs in clone.out? 3- For one of my samples, there are more than five contigs per each cell in the clone.out file and they have different VJ assignments for one cell. How is this possible and how do we justify that?

Thanks, Sara

On Mon, Dec 12, 2022 at 12:00 PM Sara Moien @.***> wrote:

Sorry, please ignore my last email Li. Thank you!

On Mon, Dec 12, 2022 at 11:43 AM Sara Moien @.***> wrote:

Thank you Li.

Just related to TCR, how I should filter my TCR results? Any reads in CDR3.out that is on TR* is TCR?

Sorry, if my questions are not that accurate. My background is computer science.

Thanks, Sara

On Mon, Dec 12, 2022 at 10:24 AM Li Song @.***> wrote:

Right.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1346694370, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONRWNUVZ34RMXU3ZU43WM47SLANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

The contigs could be for heavy/light, beta/alpha chains. trus4_cdr3.out file also contains partial CDR3s, which could araising from non-VDJ recombined transcripts or some other artifacts. Therefore, it is expected to have multiple contigs per 1 cell barcode.
trust4_cdr3.out is for the contig with some CDR3 information. I'm not sure how your clone.out file was generated.
It could be the samples are overloaded with too many cells, and many barcode becomes doublet. There could be other reasons, which I'm not sure about without looking into the data.

saramoein372 commented 1 year ago

Thanks Li. Related to clone.out file, it is generated from the trust-cluster.py code on the github. So, I wanted to know how from many partial CDR3 any many contigs, we select some of them in the trust-cluster.py code.

Related to the third question, that the trust-cluster.py result contains many contigs for each cell, if they are doublets or for what ever reason, how can I filter the contigs? I have shared my trust-cluster.py result here. Not sure if you can see it in github.... I need to know how to filter this file... there are many contigs per each cell.

thanks!

On Mon, Dec 19, 2022 at 11:46 PM Li Song @.***> wrote:

1.

The contigs could be for heavy/light, beta/alpha chains. trus4_cdr3.out file also contains partial CDR3s, which could araising from non-VDJ recombined transcripts or some other artifacts. Therefore, it is expected to have multiple contigs per 1 cell barcode. 2.

trust4_cdr3.out is for the contig with some CDR3 information. I'm not sure how your clone.out file was generated. 3.

It could be the samples are overloaded with too many cells, and many barcode becomes doublet. There could be other reasons, which I'm not sure about without looking into the data.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1358835549, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONVLOYHGCNMOGL7LNRDWOE23VANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

The trust-cluster.py has two filters, one is to filter partial CDR3s, the other is to filter the contigs without both V and J gene annotations.
The downstream filter might be application dependent. trust-cluster.py will also cluster CDR3s across multiple cells, so I'm not sure how you are interpreting the results.

saramoein372 commented 1 year ago

Thanks Li. One question I have from your paper: In supplement, figure 2, what is the definition of precision and the definition of # of recall?

On Tue, Dec 20, 2022 at 5:10 PM Li Song @.***> wrote:

1.

The trust-cluster.py has two filters, one is to filter partial CDR3s, the other is to filter the contigs without both V and J gene annotations. 2.

The downstream filter might be application dependent. trust-cluster.py will also cluster CDR3s across multiple cells, so I'm not sure how you are interpreting the results.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1360385431, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONXUGZRRLC7ITZAXQU3WOIVDVANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

Precision: the fraction of V-CDR3-J-C called from RNA-seq that are matched with the iRepertoire BCR-seq. The V and J genes ignore the allele information (the information after *).

# of recall: the number of V-CDR3-J-Cs found in iRepertoire BCR-seq that are found in the RNA-seq-based results.

saramoein372 commented 1 year ago

Thanks. And in the Fig.2. (b) in supplement, what is the "BCR copies" ?

On Wed, Dec 21, 2022 at 10:21 AM Li Song @.***> wrote:

Precision: the fraction of V-CDR3-J-C called from RNA-seq that are matched with the iRepertoire BCR-seq. The V and J genes ignore the allele information (the information after *). of recall: the number of V-CDR3-J-Cs found in iRepertoire BCR-seq that are found in the RNA-seq-based results.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1361474671, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONWRFDE4IMZG7E4ST6DWOMOAHANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

"BCR copies" is the abundance value reported by iRepertoire, such as BCR mRNA copies.

saramoein372 commented 1 year ago

Thank you Li.

I have a question:

1- I have 10X sc-BCR data, single reads. In the "candidate reads selection", the first part of the algorithm, would you please explain about the "The threshold is maximum(21, read_length/5 + 1), so data with shorter reads have less stringent criteria. "? What is this step?

All my reads are BCR reads, and I expect the algorithm does not filter any reads. How is this working?

2- Do I need to make change on maximum(21, read_length/5 + 1) to avoid the filtering? All my data all BCR.

Thanks, Sara

On Wed, Dec 21, 2022 at 1:11 PM Li Song @.***> wrote:

"BCR copies" is the abundance value reported by iRepertoire, such as BCR mRNA copies.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1361795177, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONXZNYI4JH2XVI2WLELWONB3LANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

This is the step used in the extracting candidate reads. The length is the number of bases covered by the concordant kmer hits. The chain is constructed by the longest increasing subsequence algorithm.

If all the reads are from BCR, then all of the them should be retained. There could be some reads got filtered, such as polya tail or other sequencing artifacts.

If you don't want to the filter at all, you can run TRUST4 with option "--noExtraction".

saramoein372 commented 1 year ago

Thank you Li! So, if I understand correctly, when my raw data is only BCR, then trust4 will pass all of the reads. Correct? And without the --noExtraction result should be okay. Right?

Also, I wanted to ask some details about the strategy of TRUST4 when there are somatic hypermutation in "scRNA-seq"? It would be great if you provide all the steps for capturing the reads when somatic hypermutation is happening.

Thank you, Sara

On Thu, Dec 22, 2022, 7:17 PM Li Song @.***> wrote:

This is the step used in the extracting candidate reads. The length is the number of bases covered by the concordant kmer hits. The chain is constructed by the longest increasing subsequence algorithm.

If all the reads are from BCR, then all of the them should be retained. There could be some reads got filtered, such as polya tail or other sequencing artifacts.

If you don't want to the filter at all, you can run TRUST4 with option "--noExtraction".

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1363461691, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONX4BXCISUSTSUIZ5CLWOTVSLANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 1 year ago

You can run TRUST4 without --noExtraction. It is safer to do so in order to filter some noisy reads.

For SHM, do you mean you want to get the reads containing SHMs or do you think TRUST4 would miss those reads? TRUST4 allows SHM up to 20% of the read, so it should capture those SHMs.

saramoein372 commented 1 year ago

Thanks. "For SHM, do you mean you want to get the reads containing SHMs or do you think TRUST4 would miss those reads? TRUST4 allows SHM up to 20% of the read, so it should capture those SHMs."

I just need to know the strategy of the trust4 related to SHM when we have the sc-BCR data. I appreciate if you provide steps for managing the SHM in scBCR.

Thanks.

On Thu, Dec 22, 2022 at 11:13 PM Li Song @.***> wrote:

You can run TRUST4 without --noExtraction. It is safer to do so in order to filter some noisy reads.

For SHM, do you mean you want to get the reads containing SHMs or do you think TRUST4 would miss those reads? TRUST4 allows SHM up to 20% of the read, so it should capture those SHMs.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/169#issuecomment-1363595321, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONUSD5ELKNICBZX23PDWOURFRANCNFSM6AAAAAASNQNL5E . You are receiving this because you authored the thread.Message ID: @.***>

liulab-dfci / TRUST4

how the VDJs are assigned to contigs #169