liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
286 stars 50 forks source link

FIltering method #163

Open saramoein372 opened 2 years ago

saramoein372 commented 2 years ago

Hello,

I have a question about the filtering steps are done by TRUST4? I read the paper, but wanted to know if there are more criteria for filtering the reads/cells, which are not shared in the paper.

The reason I am asking is that we want to know why some of our reads that are captured by cellranger, are not captured in TRUST4.

Thanks, Sara

mourisl commented 2 years ago

Did you run TRUST4 on the 10X TCR/BCR-kit data or the GEX data?

saramoein372 commented 2 years ago

No, it was a different data. Is the TRUST4 on the 10X TCR/BCR-kit data for a healthy sample?

On Tue, Nov 1, 2022 at 5:06 PM Li Song @.***> wrote:

Did you run TRUST4 on the 10X TCR/BCR-kit data?

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1299164509, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONT37JONGMOKMF7RDL3WGGA45ANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

Do you mean you run cellranger on the GEX data and they find some CDR3s that are missing by TRUST4?

saramoein372 commented 2 years ago

We did run cell ranger on bcr data, and then tried the same data with trust4, and we see trust4 captured more cells. But some of the cells that cell ranger captured is not among trust4 captured cells.

On Tue, Nov 1, 2022, 6:28 PM Li Song @.***> wrote:

Do you mean you run cellranger on the GEX data and they find some CDR3s that are missing by TRUST4?

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1299308361, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONWGDRACZ5JRD4BCQWLWGGKPFANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

TRUST4 has very few filters. I guess this could be the algorithmic difference. One difference is that 10x has modified IMGT gene annotation a little bit, and I'm not sure how much this will cause the assembly difference.

saramoein372 commented 2 years ago

Thanks. Are there any results about running TRUST4 on BCR data of a healthy sample? Or do you have any fastq files (or bam) that I run TRUST4 on that?

On Tue, Nov 1, 2022 at 8:10 PM Li Song @.***> wrote:

TRUST4 has very few filters. I guess this could be the algorithmic difference. One difference is that 10x has modified IMGT gene annotation a little bit, and I'm not sure how much this will cause the assembly difference.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1299379440, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONQRKEF2ZODKJJ3IN7TWGGWQRANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

Sorry, I did not test TRUST4 on BCR-seq much, and most results were based on RNA-seq data.

mourisl commented 2 years ago

Do you mean 10x single-cell BCR or bulk BCR-seq data?

saramoein372 commented 2 years ago

10x single cell bcr data

On Wed, Nov 2, 2022, 1:39 PM Li Song @.***> wrote:

Do you mean 10x single-cell BCR or bulk BCR-seq data?

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1300999871, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONWXOABFOLITNE5MZODWGKROJANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

I think the immune profiling 10K-cell PBMC data on the 10x website is from a healthy donor.

saramoein372 commented 2 years ago

Thank you so much Li! One more question: how I can assign the contig_ids to the cells in the FR2_cdr3.out?

I can see the trust_final_out. But not sure how to relate the cluster_ids and cdr3 files to the contigs.

I appreciate your support!

Best, Sara

On Wed, Nov 2, 2022 at 2:02 PM Li Song @.***> wrote:

I think the immune profiling 10K-cell PBMC data on the 10x website is from a healthy donor.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1301024370, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONVJRFB5K34H2MSBGGTWGKUC7ANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

In your TRUST4_barcode_report.tsv file, in each chain's comma-separated information, the eighth column is the "consensus_id", and this connects to the contig_ids.

saramoein372 commented 2 years ago

Thanks Li.

So is it true to say TRUST4_barcode_report.tsv contains all the contigs after removal of all cells with low quality? And all contigs in TRUST4_barcode_report.tsv are the filtered contigs that I can use for downstream analysis?

On Thu, Nov 3, 2022 at 1:25 PM Li Song @.***> wrote:

In your TRUST4_barcode_report.tsv file, in each chain's comma-separated information, the eighth column is the "consensus_id", and this connects to the contig_ids.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1302441034, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONURYWP33XYHCXDKWZLWGPYRTANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

Not all the contigs. Even from the high-quality cells, there could be fragmented assemblies and other artifacts. Those will be present in the _final.out and _annot.fa file, but they will not show up in the barcode_report.tsv file. So all the contigs in TRUST4_barcode_report.tsv is the source for downstream analysis.

If you are interested in the sequence outside of the CDR3 region, you can also use the barcode_airr.tsv following the AIRR format specification.

saramoein372 commented 2 years ago

Thanks Li.

Can I ask how the TRUST4_barcode_report.tsv is generated? How are the cells filtered? Because it has much less rows compared to TRUST4_cdr3.out.

On Thu, Nov 3, 2022 at 2:02 PM Li Song @.***> wrote:

Not all the contigs. Even from the high-quality cells, there could be fragmented assemblies and other artifacts. Those will be present in the _final.out and _annot.fa file, but they will not show up in the barcode_report.tsv file. So all the contigs in TRUST4_barcode_report.tsv is the source for downstream analysis.

If you are interested in the sequence outside of the CDR3 region, you can also use the barcode_airr.tsv following the AIRR format specification.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1302483785, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONXXRJ65OTISO4HZIATWGP435ANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

TRUST4_barcode_report.tsv selects one representative pairs CDR3 for each cell. The number of rows in the barcode_report.tsv file is the number of detected cells, and the number of rows in TRUST4_cdr3.out is the number of detected CDR3s, which includes partial CDR3s and also many artifacts. Therefore, the number of rows in barcode_report.tsv is much less than the cdr3.out file. Hope this is clear.

saramoein372 commented 2 years ago

Thank you Li. In TRUST4_barcode_report.tsv, the fist column is the cell barcode, but the 8th column in "chain" shows contigs. Why some of the values in 8th columns of "chain" are NA?

On Thu, Nov 3, 2022 at 2:44 PM Li Song @.***> wrote:

TRUST4_barcode_report.tsv selects one representative pairs CDR3 for each cell. The number of rows in the barcode_report.tsv file is the number of detected cells, and the number of rows in TRUST4_cdr3.out is the number of detected CDR3s, which includes partial CDR3s and also many artifacts. Therefore, the number of rows in barcode_report.tsv is much less than the cdr3.out file. Hope this is clear.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1302527665, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONXWIIOGS7IVROUBGZDWGQBX5ANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

This shouldn't happen. Could you please show me an example?

saramoein372 commented 2 years ago

In my data, I see this (the second row):

[image: Screen Shot 2022-11-03 at 3.18.54 PM.png]

On Thu, Nov 3, 2022 at 3:01 PM Li Song @.***> wrote:

This shouldn't happen. Could you please show me an example?

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1302545089, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONTECE2SCFREB64XNX3WGQDZHANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

Sorry, GitHub could not show the image.

saramoein372 commented 2 years ago

In my data, I see this (the second row):

I have some rows that column 1 has cell barcode, but column 8th in chain2 are ampty

mourisl commented 2 years ago

This means TRUST4 only detected one chain for this cell.

saramoein372 commented 2 years ago

Got it. Thank you.

On Thu, Nov 3, 2022 at 3:46 PM Li Song @.***> wrote:

This means TRUST4 only detected one chain for this cell.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1302589151, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONSDNQAVTUPON23GH2TWGQJC3ANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 2 years ago

Hi Li,

I have a question about the definition of each of the output files. What is:

1- FR2_assembled_reads.fa 2- FR2_raw.out

I am looking for a file that contains all the raw reads with VDJ assignments. Is "FR2_raw.out" the file I am need? Thanks, Sara

On Thu, Nov 3, 2022 at 3:50 PM Sara Moien @.***> wrote:

Got it. Thank you.

On Thu, Nov 3, 2022 at 3:46 PM Li Song @.***> wrote:

This means TRUST4 only detected one chain for this cell.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1302589151, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONSDNQAVTUPON23GH2TWGQJC3ANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

FR2_raw.out is the raw assembled contig file. The assembled reads are the reads used in the assembly, so this is probably the file you want to use. But I think the FR2_assembled_reads.fa file does not contain the reads from the 3' end of the C gene.

If you need more detailed read assignment information, you can add the option "--outputReadAssignment". It will create a file "XXX_align.tsv" where the first column is the read id and the second column is the contig id this read assigned to.

saramoein372 commented 2 years ago

Li, Thank you so much for fast replying me.

What is FR2_cdr3.out? And FR2_report.tsv?

On Thu, Nov 10, 2022 at 4:10 PM Li Song @.***> wrote:

FR2_raw.out is the raw assembled contig file. The assembled reads are the reads used in the assembly, so this is probably the file you want to use. But I think the FR2_assembled_reads.fa file does not contain the reads from the 3' end of the C gene.

If you need more detailed read assignment information, you can add the option "--outputReadAssignment". It will create a file "XXX_align.tsv" where the first column is the read id and the second column is the contig id this read assigned to.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1310895852, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONXH3N7IBSYM3CBH6SDWHVQEVANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

FR2_cdr3.out is the CDR3 information for each assembled consensus contig, and the report.tsv file is a further simple representation of the CDR3 by coalescing identical terms in the cdr3.out file. In other words, FR2_cdr3.out is a contig-driven CDR3 file, and FR2_report.tsv is a CDR3-driven CDR3 file.

saramoein372 commented 2 years ago

How I can get the VDJ assignments in FR2_raw.out?

On Thu, Nov 10, 2022 at 4:59 PM Li Song @.***> wrote:

FR2_cdr3.out is the CDR3 information for each assembled consensus contig, and the report.tsv file is a further simple representation of the CDR3 by coalescing identical terms in the cdr3.out file. In other words, FR2_cdr3.out is a contig-driven CDR3 file, and FR2_report.tsv is a CDR3-driven CDR3 file.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1310958623, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONRWDNFO73JLGM6XHY3WHVV2ZANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

FR2_raw.out will be further assembled with the mate-pair information into the FR2_final.out file. The VDJ assignment for each entry in the FR2_final.out will be in the FR2_annot.fa file. Hope this is clear.

saramoein372 commented 2 years ago

Hi Li,

I am a little confused with many different files I have available. Can I ask which file contains the raw data in fasta format?

Thanks, Sara

On Thu, Nov 10, 2022 at 5:05 PM Li Song @.***> wrote:

FR2_raw.out will be further assembled with the mate-pair information into the FR2_final.out file. The VDJ assignment for each entry in the FR2_final.out will be in the FR2_annot.fa file. Hope this is clear.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1310963747, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONS6WAAPBN534HQURX3WHVWSFANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

Sorry, I don't quite get the question, what do you mean by the raw data?

saramoein372 commented 2 years ago

I mean the raw reads, before being assembeled

On Fri, Nov 11, 2022 at 10:48 AM Li Song @.***> wrote:

Sorry, I don't quite get the question, what do you mean by the raw data?

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1311866651, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONT2AQJMZKQPYADHHOTWHZTEJANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

They are in the trust_assembled_reads.fa file. This file does not have the VDJ annotation. If you want to add that, you can run the program "annotator" in TRUST4, with extra options like "--fasta --needReverseComplement".

saramoein372 commented 2 years ago

Thanks Li. Is there any example of running annotator?

On Fri, Nov 11, 2022 at 11:11 AM Li Song @.***> wrote:

They are in the trust_assembled_reads.fa file. This file does not have the VDJ annotation. If you want to add that, you can run the program "annotator" in TRUST4, with extra options like "--fasta --needReverseComplement".

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1311899308, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONWC3USQ72LMUEESRN3WHZV2PANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

If you keep the running log of TRUST4, you can find the "annotator" command there.

saramoein372 commented 2 years ago

Thanks Li.

I can not find any line related to annotate.cpp code in my current log file.

Would you please provide me an example here? Just the command. What input should I provide?

On Fri, Nov 11, 2022 at 11:20 AM Li Song @.***> wrote:

If you keep the running log of TRUST4, you can find the "annotator" command there.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1311910510, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONWRCBS7RXN7S7FI3F3WHZW6NANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

You can directly run "./annotator" to get the readme. In your example, it could be annotator -f IMGT+C.fa -a assembled_reads.fa --fasta --needReverseComplement -t 8.

saramoein372 commented 2 years ago

Not working fro me :((

(/athena/namlab/scratch/sam4032/trust) @.*** TRUST4]$ ./Annotator.cpp -f /athena/namlab/scratch/sam4032/HL8_s1s2/IMGT+C.fa -a /athena/namlab/scratch/sam4032/HL8_s1s2/out_FR2/FR2_assembled_reads.fa --fasta --needReverseComplement -t 8

./Annotator.cpp: line 14: char: command not found

./Annotator.cpp: line 15: Required:\n: command not found

./Annotator.cpp: line 16: \t-f STRING: fasta file containing the receptor genome sequence\n: command not found

./Annotator.cpp: line 17: \t-a STRING: path to the assembly file\n: command not found

./Annotator.cpp: line 18: Optional:\n: command not found

./Annotator.cpp: line 19: \t-r STRING: path to the reads used in the assembly\n: command not found

./Annotator.cpp: line 20: \t--fasta: the assembly file is in fasta format (default: false)\n: command not found

./Annotator.cpp: line 21: \t--fastq: the assembly file is in fastq format (default: false)\n: command not found

./Annotator.cpp: line 22: \t-t INT: number of threads (default: 1)\n: command not found

./Annotator.cpp: line 23: \t-o STRING: the prefix of the file containing CDR3 information (default: trust)\n: command not found

./Annotator.cpp: line 24: //\t--partial: including partial CDR3s in the report (default: false)\n: No such file or directory

./Annotator.cpp: line 25: \t--barcode: there is barcode information in -a and -r files (default: not set)\n: command not found

./Annotator.cpp: line 26: \t--UMI: there is UMI information in -r file (default: not set)\n: command not found

./Annotator.cpp: line 27: \t--geneAlignment: output the gene alignment (default: not set)\n: command not found

./Annotator.cpp: line 28: \t--airrAlignment: output the aligned sequences to prefix_airr_align.tsv (default: not set)\n: command not found

./Annotator.cpp: line 29: \t--noImpute: do not impute CDR3 sequence for TCR (default: not set (impute))\n: command not found

./Annotator.cpp: line 30: \t--notIMGT: the receptor genome sequence is not in IMGT format (default: not set(in IMGT format))\n: command not found

./Annotator.cpp: line 31: \t--outputCDR3File: output CDR3 file when not using -r option (default: no output)\n: command not found

./Annotator.cpp: line 32: \t--needReverseComplement: reverse complement sequences on another strand (default: no)\n: command not found

./Annotator.cpp: line 33: \t--readAssignment STRING: output the read assignment to the file (default: no output)\n: command not found

./Annotator.cpp: line 35: char: command not found

./Annotator.cpp: line 36: -1,: command not found

./Annotator.cpp: line 37: -1,: command not found

./Annotator.cpp: line 38: -1,: command not found

./Annotator.cpp: line 40: char: command not found

./Annotator.cpp: line 42: char: command not found

./Annotator.cpp: line 43: char: command not found

./Annotator.cpp: line 44: char: command not found

./Annotator.cpp: line 45: char: command not found

./Annotator.cpp: line 47: static: command not found

./Annotator.cpp: line 48: static: command not found

./Annotator.cpp: line 62: syntax error near unexpected token `0,'

./Annotator.cpp: line 62: ` { (char *)0, 0, 0, 0} '

On Fri, Nov 11, 2022 at 11:30 AM Li Song @.***> wrote:

You can directly run "./annotator" to get the readme. In your example, it could be annotator -f IMGT+C.fa -a assembled_reads.fa --fasta --needReverseComplement -t 8.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1311921986, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONUN4F3OKIGT6XYT3R3WHZYD5ANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

Annotator.cpp is the source code. You need to run its compiled executable file "annotator" in the folder.

saramoein372 commented 2 years ago

Sorry, but the folder only ha "Annotator.cpp". Would you please send me the location of this file "annotator"?

Thank you!

On Fri, Nov 11, 2022 at 11:46 AM Li Song @.***> wrote:

Annotator.cpp is the source code. You need to run its compiled executable file "annotator" in the folder.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1311936577, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONUGBGOHWQF45YDXFBLWHZZ67ANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

Have you run "make" to compile TRUST4? If you installed TRUST4 from conda, you can directly run "annotator".

saramoein372 commented 2 years ago

Thank you Li.

I did some search in my compiled folder ./trust/bin and could find it. Appreciate your support!

One more question: which type of files need the "--needReverseComplement"? I am not sure if I should add this option or not? How can I find the answer to this question? Sorry, I am not a biology person.

On Fri, Nov 11, 2022 at 11:59 AM Li Song @.***> wrote:

Have you run "make" to compile TRUST4? If you installed TRUST4 from conda, you can directly run "annotator".

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1311947862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONQVM573B5I6PRYEVEDWHZ3QHANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

saramoein372 commented 2 years ago

And one last question:

The file "assembled_reads.fa" is not the contigs. Correct? Can we say "assembled_reads.fa" is the raw reads after some steps of assembly? And it is not the contigs? Amd not the raw reads?

I just try to understand the meaning of these files correctly.

Thank you so much!

On Fri, Nov 11, 2022 at 12:07 PM Sara Moien @.***> wrote:

Thank you Li.

I did some search in my compiled folder ./trust/bin and could find it. Appreciate your support!

One more question: which type of files need the "--needReverseComplement"? I am not sure if I should add this option or not? How can I find the answer to this question? Sorry, I am not a biology person.

On Fri, Nov 11, 2022 at 11:59 AM Li Song @.***> wrote:

Have you run "make" to compile TRUST4? If you installed TRUST4 from conda, you can directly run "annotator".

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1311947862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONQVM573B5I6PRYEVEDWHZ3QHANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

For TRUST4 assembled contig, they are all from 5'->3', so the annotator does not need to consider the reverse complement. For raw data, some reads need to be reverse-complemented.

The assembled_reads file is not for the contigs. It contains the reads that are used to build the contigs. There could be some processing to the reads in the file, such as trimming, merged mate-pair reads.

saramoein372 commented 2 years ago

Thank you so much Li.

My annotator code is running. I just wanted to ask for an estimation of the hours it needs to finish running. How long does it usually take to run?

On Fri, Nov 11, 2022 at 12:54 PM Li Song @.***> wrote:

For TRUST4 assembled contig, they are all from 5'->3', so the annotator does not need to consider the reverse complement. For raw data, some reads need to be reverse-complemented.

The assembled_reads file is not for the contigs. It contains the reads that are used to build the contigs. There could be some processing to the reads in the file, such as trimming, merged mate-pair reads.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1311999286, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONQA5F2WXAVTST7S75TWH2B6HANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

It annotates about 200K reads/minute with 8 threads. You can specify more threads (-t) to make it faster.

saramoein372 commented 2 years ago

Hi Li,

I have a question: as you remember I need to get the VDJ assignment for the reads in assembled_reads.fa I am running: with running annotator -f IMGT+C.fa -a assembled_reads.fa --fasta --needReverseComplement -t 8

So where the output goes? Do I need to add an output directory?

Thanks!

On Fri, Nov 11, 2022 at 2:51 PM Li Song @.***> wrote:

It annotates about 200K reads/minute with 8 threads. You can specify more threads (-t) to make it faster.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1312141420, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONTODR26CVKVOALVFZTWH2PVVANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

Sorry I missed that. It directly outputs on the screen, but you can use ">" to redirect the output to a file.

saramoein372 commented 2 years ago

Thank you Li. In my file.out, I could see the reads. But I am not sure which names are VDJ assignments. For example, in these two lines:

A00814:550:HYJTNDSX2:4:1159:14362:5854 90 0.18 * IGKC(523):(0-89):(46-135):100.00 CDR1(0-0):0.00=null CDR2(0-0):0.00=null CDR3(0-0):0.00=null

GCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGC

A00814:550:HYJTNDSX2:4:1159:15646:11647 90 0.18 * IGKC(523):(0-89):(46-135):100.00 CDR1(0-0):0.00=null CDR2(0-0):0.00=null CDR3(0-0):0.00=null

GCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGC

Which of the names are showing the VDJ assignments?

Thank you!

On Sun, Nov 13, 2022 at 8:13 PM Li Song @.***> wrote:

Sorry I missed that. It directly outputs on the screen, but you can use ">" to redirect the output to a file.

— Reply to this email directly, view it on GitHub https://github.com/liulab-dfci/TRUST4/issues/163#issuecomment-1312918630, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVJONTSPN3DOKW64PXJJSLWIGG45ANCNFSM6AAAAAARUNTNUQ . You are receiving this because you authored the thread.Message ID: @.***>

mourisl commented 2 years ago

These reads are from the constant gene, as you can see the IGKC annotation. The output format of tthe annotator is described in README.md.