empty cell_id column in airr file from 10x Genomics data

liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

MIT License

256 stars 46 forks source link

empty cell_id column in airr file from 10x Genomics data #253

Open mapo9 opened 3 months ago

mapo9 commented 3 months ago

Hey guys!

I am running TRUST4 on 10xGenomics RNAseq data. Especially, we make use of the AIRR file for our post-processing analysis. For a QC protocol, we require the cell_ids, however I see that at least for me this column is always empty. Is this expected behaviour or is there a problem in my command:

run-trust4 \
    -1 ../test_files/SRR20751887_S1_L001_R1_001.fastq -2 ../test_files/SRR20751887_S1_L001_R2_001.fastq \
    --barcode ../test_files/SRR20751887_S1_L001_R1_001.fastq \
    --readFormat bc:0:15,um:16:27 \
    --UMI ../test_files/SRR20751887_S1_L001_R2_001.fastq \
    -t 6 \
    -f IMGT+C.fa \
    -o out \

Do you store the cell_ids somewhere so that I could add the column to out_airr.tsv?

Also, I see that you also create out_barcode_airr.tsv, I dont really get what to do with this file.

Thanks a lot!

mourisl commented 3 months ago

The cell_id should be in the out_barcode_airr.tsv file, which gives information about the BCR/TCR for each barcode. The file without "barcode" is the summary of the clonotypes, for example, the "consensus_count" is the number of cells bearing this clonotype (V, J, C, CDR3nt). I may remove the "cell_id" column in the bulk AIRR format (the one without barcode) to avoid the confusions.

There might be some other issues with your running command too. Is the UMI and barcode from the same file? Based on the --readFormat, I guess the format is 16bp barcode followed by 12bp UMI in the R1 file, but you put R2 file as the UMI file. Another issue is that if R1 file also contains VDJ sequence information, you may need to add something like " --readFormat bc:0:15,um:16:27,r1:28:-1" so that it will exclude the barcode and UMI portion in the assembly.

Hope this helps.

mapo9 commented 3 months ago

Thanks for your reply! Then I guess there is a way to "stitch" together the information from the 2 files to infer the cell_ids for each clonotype?

Thanks for the heads up with the R2, that was just a typo here. It's as you said: first 16bp of R1 is barcode, next 12 bp is UMI.

Now I also understand what the r1 is there for in the readFormat 💡

mourisl commented 3 months ago

I don't think you need to stitch them together. The barcode_airr contains the information about the clonotype for each cell barcode already. Or do you mean you want to have the list of cell ids with the same clonotype?

mapo9 commented 3 months ago

I want to have them in the TRUST_airr.tsv file. Or do I understand sth wrong?

I thought that the TRUST_airr.tsv file holds the clonotypes extracted from the RNA seq data and that barcode_airr contains the barcodes in the file? Or is the barcode_airr the actual clonotypes without being condensed with consensus_count?

mourisl commented 3 months ago

Right. barcode_airr file is actual clonotypes for a barcode. The "consensus_count" in that file is the number of reads/UMI supporting this contig.