phbradley / conga

Clonotype Neighbor Graph Analysis
MIT License
79 stars 18 forks source link

What does spurious chain sharing mean? #56

Open chuckzzzz opened 1 year ago

chuckzzzz commented 1 year ago

Hi! Thanks for developing this amazing tool. I have a conceptual question regarding the filtering step.

In the paper, you mentioned:

Here, by default, the 10x cellranger clonotype definitions are filtered to remove spurious chain sharing and merge split clonotypes (for example, due to partial recovery of a second TCRα transcript).

This is also reflected by the default stringent=True for the make_10x_clones_file function.

However, this filtering step results in the majority of my paired data being dropped.

repeat?? 1598 36 ('TRAV12-101', 'TRAJ4101', 'CVVNGNSGYALNF', 'tgtgtggtgaacggcaattccgggtatgcactcaacttc') ('TRAV12-201', 'TRAJ4101', 'CVVNGNSGYALNF', 'tgtgtggtgaacggcaattccgggtatgcactcaacttc') ... old_unpaired_barcodes: 44 old_paired_barcodes: 131357 new_stringent_paired_barcodes: 238

Setting stringent=False would obviously circumvent this problem, but I wonder what this actually means and whether my data has something seriously wrong haha? From the output, I am assuming there are duplicates but wouldn't clonal expansion also be considered as duplicates and thus be removed? Anyways it would be cool to know what exactly this stringent criteria is filtering for.

Thanks in advance!

phbradley commented 1 year ago

Hi there, thanks for trying conga! Great question. With 10x VDJ data, it's not uncommon to see pairings between one of the chains (alpha or beta) in a high frequency (ie, highly expanded) clonotype and multiple other diverse chains at much lower counts. In other words, you see that two clonotypes, with very different clone sizes, share one of their chains (share in the sense of exact identity at the nucleotide level). Our interpretation of this is that there is some "leakage" (maybe ambient RNA) of the high-frequency transcripts, and these TCR transcripts get encapsulated in droplets that they don't really belong to. So in the filtering step, the alpha-beta pairings are sorted by the number of cells each pairing was observed in, and then we go through that list in decreasing order of cell count, looking for chains that we've already seen paired with a different partner and much higher count. There's some logic to allow for clonotypes with two in-frame alpha chains. In all of this, clonal expansion has been accounted for, and we are just operating at the level of unique alpha-beta sequence pairs (each might occur in many individual droplets/barcodes).

Can you tell me a bit more about your data? That is a very large number of pairings! Is this from the new ChromiumX, or have multiple filtered_contig files been combined? Are these invariant T cells where we might expect an unusually high level of exact chain sharing? I've found that it's best to apply this filtering analysis at the level of individual 10x runs, since that's the situation where we run into this chain sharing.

Happy to elaborate if any of this is unclear, and to help debug if possible. If you were comfortable sharing your contigs file, I could process it on my end and see if there's anything funny going on. Or you could share the full log file. pbradley@fredhutch.org

chuckzzzz commented 1 year ago

Wow. Thank you for this very quick and detailed response!

Yes, let me explain a bit about my data. So right now I have 40 samples sequenced with 10X immune profiling and I used cellranger VDJ on each of the samples. The RNA part goes through the standard cell ranger, integration, and QC stuff to become a single Seurat object. Then it is written to h5 per the instruction on Github. This means I have 40 contig files and 1 h5 format counts file for all 40 samples.

Based on #28, I tried merging them together with the make_10x_clones_file_batchfunction but it failed. I created the meta file in csv format like this: file batch_id
{path_to_A1_filtered_contig_annotations.csv} A1
{path_to_A2_filtered_contig_annotations.csv} A2
... ...
{path_to_H5_filtered_contig_annotations.csv} H5

My barcode in the Seurat object has sample ID as the prefix followed by an underscore and then a "-1" (e.g. A1_ACCTGGATAT-1). So I called this function like this:

make_10x_clones_file_batch({path_to_meta_file}, "human", clones_file, add_batch_id_location="prefix", batch_id_delim="_")

However, it spits out an error message after denoting the presence of repeats.

repeat?? 3 93 ('TRBV11-101', 'TRBJ2-501', 'CASSLLSDSLEETQYF', 'tgtgccagcagcttattgagtgactccttagaagagacccagtacttc') ('TRBV11-301', 'TRBJ2-501', 'CASSLLSDSLEETQYF', 'tgtgccagcagcttattgagtgactccttagaagagacccagtacttc') repeat?? 1836 41 ('TRAV12-101', 'TRAJ4101', 'CVVNGNSGYALNF', 'tgtgtggtgaacggcaattccgggtatgcactcaacttc') ('TRAV12-201', 'TRAJ4101', 'CVVNGNSGYALNF', 'tgtgtggtgaacggcaattccgggtatgcactcaacttc') repeat?? 1 2 ('TRBV12-401', 'TRBJ2-201', 'CASSEREGLTGELFF', 'tgtgccagcagtgagcgggagggcctaaccggggagctgtttttt') ('TRBV6-101', 'TRBJ2-201', 'CASSEREGLTGELFF', 'tgtgccagcagtgagcgggagggcctaaccggggagctgtttttt')


TypeError                                 Traceback (most recent call last)
<ipython-input-8-61e27aef550d> in <module>()
----> 1 make_10x_clones_file_batch("test/sample_meta.csv", organism, clones_file, add_batch_id_location="prefix", batch_id_delim="_")

conga/tcrdist/make_10x_clones_file.py in make_10x_clones_file_batch(metadata_file, organism, clones_file, replace_batch_id, strip_batch_id_location, add_batch_id_location, batch_id_delim, stringent, **kwargs) 761 762 if stringent: --> 763 clonotype2tcrs, clonotype2barcodes = setup_filtered_clonotype_dicts( clonotype2tcrs, clonotype2barcodes ) 764 765

conga/tcrdist/make_10x_clones_file.py in setup_filtered_clonotype_dicts(clonotype2tcrs, clonotype2barcodes, min_repeat_count_fraction, verbose) 536 pairs_tuple2clonotypes = {} 537 ab_counts = Counter() # for diagnostics --> 538 for (clone_size, cid) in reversed( sorted( (len(y), x) for x,y in clonotype2barcodes.items() ) ): 539 if cid not in clonotype2tcrs: 540 #print('WHOAH missing tcrs for clonotype', clone_size, cid, clonotype2barcodes[cid])

TypeError: '<' not supported between instances of 'str' and 'float'



So I tried concatenating these contig files altogether by changing the barcode to be the same format. This naming is consistent with the integrated Seurat object. I think that's why the pairing is successful. 

> old_unpaired_barcodes: 44 **old_paired_barcodes: 131357** new_stringent_paired_barcodes: 238

However, since the majority is removed I am guessing simply concatenating the filtered contig files together might not be the right thing to do. I would love to share the data, but let me double-check with my PI first. I am thinking maybe a few of the contig files should be enough for testing. 

In the mean time, any thoughts on which step I did wrong based on my description? Thanks so much!
chuckzzzz commented 1 year ago

Hi I can share a few annotation files with you, maybe over email? Is pbradley@fredhutch.org your email address?

phbradley commented 1 year ago

Yes, that email address is correct. Take care, Phil


From: chuckzzzz @.> Sent: Wednesday, November 30, 2022 8:29 AM To: phbradley/conga @.> Cc: Bradley PhD, Phil @.>; Comment @.> Subject: Re: [phbradley/conga] What does spurious chain sharing mean? (Issue #56)

Hi I can share a few annotation files with you, maybe over email? Is @.*** your email address?

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/phbradley/conga/issues/56*issuecomment-1332430302__;Iw!!GuAItXPztq0!mxi_ERifpFkIlAzUNKYckp2lfmsQmUlNKQ0aa3sqcWE8Q_mRC7dXPR64c58qvucecoY5jPTaiQVS0z1OzixlOx1j$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABBNCH5SSKEBNL6NUD7SYRTWK56HTANCNFSM6AAAAAASIJDBNQ__;!!GuAItXPztq0!mxi_ERifpFkIlAzUNKYckp2lfmsQmUlNKQ0aa3sqcWE8Q_mRC7dXPR64c58qvucecoY5jPTaiQVS0z1OzhAYwEyg$. You are receiving this because you commented.Message ID: @.***>

chuckzzzz commented 1 year ago

Just sent it thanks!

TheRaspberryFox commented 6 months ago

Hi,

I have been using this package and have had great success with it.

However, I am currently running into an error that I have never ran into before. When I run the make_10x_clones_file_batch command I got the error mentioned in this thread:

TypeError: '<' not supported between instances of 'str' and 'float'

I am reaching out in hope that it this was solved and there may be a solution.

Much Appreciated!

phbradley commented 6 months ago

Hi there, Thanks for the kind words and for letting us know about this issue. I have not seen a new error like this yet, so it might help to get a bit more context. Could you share a bit more of the error message, like the line where it happened? It sounds like it could be a missing field in the CSV file, since then you have an 'na' value in the dataframe which I think pandas treats as a float. Take care, Phil

TheRaspberryFox commented 6 months ago

Wow, thank you for the quick reply. Happy to provide more information.

Here is the full error message:


TypeError Traceback (most recent call last) Cell In[13], line 1 ----> 1 make_10x_clones_file_batch(metadata_file = "CoNGA_metadata.csv", organism = "mouse", clones_file = "clones.tsv", strip_batch_id_location = 'prefix', add_batch_id_location = 'prefix', stringent= True)

File /media/cui-lab/Data_temp/Ryan_Brown/Pipelines/CoNGA_Pipeline/conga/conga/tcrdist/make_10x_clones_file.py:763, in make_10x_clones_file_batch(metadata_file, organism, clones_file, replace_batch_id, strip_batch_id_location, add_batch_id_location, batch_id_delim, stringent, kwargs) 754 clonotype2tcrs, clonotype2barcodes = read_tcr_data_batch( organism, 755 metadata_file, 756 replace_batch_id, (...) 759 batch_id_delim, 760 kwargs ) 762 if stringent: --> 763 clonotype2tcrs, clonotype2barcodes = setup_filtered_clonotype_dicts( clonotype2tcrs, clonotype2barcodes ) 766 _make_clones_file( organism, clones_file, clonotype2tcrs, clonotype2barcodes )

File /media/cui-lab/Data_temp/Ryan_Brown/Pipelines/CoNGA_Pipeline/conga/conga/tcrdist/make_10x_clones_file.py:538, in setup_filtered_clonotype_dicts(clonotype2tcrs, clonotype2barcodes, min_repeat_count_fraction, verbose) 536 pairs_tuple2clonotypes = {} 537 ab_counts = Counter() # for diagnostics --> 538 for (clone_size, cid) in reversed( sorted( (len(y), x) for x,y in clonotype2barcodes.items() ) ): 539 if cid not in clonotype2tcrs: 540 #print('WHOAH missing tcrs for clonotype', clone_size, cid, clonotype2barcodes[cid]) 541 continue

TypeError: '<' not supported between instances of 'float' and 'str'

phbradley commented 6 months ago

Great, thanks for that. It looks to me like this could be due to empty values in the raw_clonotype_id column of the contigs csv file. I just checked in a change:

https://github.com/phbradley/conga/commit/84c0ea72dd3a738e5cb896fca4f2fec27a8cc092

which you can just apply to the corresponding line of your code (if you don't want to update the repo), or try pulling the new version of the code from github. Let me know if that fixes it.

TheRaspberryFox commented 6 months ago

Thanks so much!

Removing the NA values fixed the error!