liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
277 stars 48 forks source link

Filter cells from TRUST4 result #87

Open Chenjunjie1996 opened 3 years ago

Chenjunjie1996 commented 3 years ago

Hi, recently i analyze the same sample data by Cellranger and TRUST4. Although i have used barcoderep-filter script, the cell number of TRUST barcode report (6k Bcells) is still much higher than expect (3k Bcells). I have refered https://github.com/liulab-dfci/TRUST4/issues/82#issue-996603391. Do you have any suggestions for further filtering cells in order to get the expected cell number?

mourisl commented 3 years ago

Is this BCR amplified library? If it is, you can also filter based on the number of reads for each cell in the barcode_report.tsv file.

Chenjunjie1996 commented 3 years ago

Thank you for your reply. Yes, it's amplified library. However, i have got the annot.fa(full length result), the cell number in full length result is too high. So, can i use the filtered cell barcode after using barcoderep-filter script to match the cell barcode in full length result in order to reduce the cell number? by the way, what does the secondary-chain1 and secondary-chain2 mean in filtered barcode report?

Chenjunjie1996 commented 3 years ago

I mean i have extracted the cell barcode list from full length result. and i also have extracted cell barcode from the filtered barcode report to match the cell barcode list to get a new cell barcode list. But the cell number is still higher(about 4k). So how to further filter the cell number based on this new cell barcode list?

mourisl commented 3 years ago

The secondary-chains mean the other CDR3 assembled in that cell. But because their abundance is relatively lower, so I put them in the secondary category.

Full-length assemblies usually require more number of reads to support, therefore these are not likely to be false positive. Do you see those cells in the cellranger raw contig files? For amplified data, there could be other tricks in cellranger to further filter the cells, but I need some time to look into those.

Chenjunjie1996 commented 3 years ago

Yes, i used GetFullLengthAssembly.pl script to get full length result and i found trust4 full length result may not filter other cell types. Therefore, i kept the cell barcodes of full length results which are identified to B cells in filtered-barcode-result report. Then, i filtered cell numbers again based on the number of reads for matched cell barcodes in the trust-report.out file. Finally, the cell numbers meet my expectation. is that sensible?

mourisl commented 3 years ago

In the recent versions of TRUST4, I put the full length indicator in the barcode report file as well: "V_gene,D_gene,J_gene,C_gene,cdr3_nt,cdr3_aa,read_cnt,consensus_id,CDR3_germline_similarity,consensus_full_length" for each chain. So you don't need to run GetFullLengthAssembly.pl, which could save future efforts.

What is the number of reads cutoff you are using? Just want to make sure, are you filter the number of reads in trust_report.tsv or trust_barcode_report.tsv file? The trust_report.tsv is for the count of each CDR3 not for cell barcodes.

Overall, you can run filtered-barcode-result , and then again filter the cells with too few read support, the result should be good. I don't think there is need to do the filter with full-length assemblies given you will filter based on the number of reads.

Chenjunjie1996 commented 3 years ago

Thank you for your explanation.

Only "barcode, cell_type, chain1, chain2, secondary_chain1, secondry_chain2" information shown in my barcode report file, so i have to run GetFullLengthAssembly.pl script to get full length information. Do you mean i don't need to run GetFullLengthAssembly.pl after i get the cdr3 result, the trust-report and the barcode-report based on cdr3.out result is enough?

Yes, i filtered again based on trustreport.tsv read count and the cutoff value is 4 after i matched B cell_type cell barcode from barcode-report to cell barcode which is extracted from full length result file. Because i only find barcode information related to read count in trust-report.tsv

mourisl commented 3 years ago

The comma-split fields mentioned above are for chain1 and chain2 respectively. After the amino acid sequence, there are a few other numbers including the read count you used. The last number is 0 or 1 indicating whether the corresponding assembly is full-length or not.

I think cutoff value 4 for BCR amplified library looks good (or even a bit conservative).