sciencisto / TCRDivER

Other
1 stars 0 forks source link

Procedure #1

Closed fbenedett closed 1 year ago

fbenedett commented 3 years ago

Hello, I have a few problems with the procedure described. You write: python3 filter_df_and_make_jobs.py - "/work3/milvu/Data/2018_Formenti_Demaria_Lung/"

I see that the arguments position-dependent. Argparse could help with that helping also to identify the arguments. Anyway, I am confused by the "path_to_scripts_folder: this is the templates folder ". If we are located in the same folder as: Readme.md, where is such a directory? Additionally, you mentioned the Adaptive Biotechnologies format. Could you provide a couple of lines of such format? I need to reformat my data accordingly.

sciencisto commented 3 years ago

HI, sorry for a delayed response. You are right argparse would be a better option and it is planned to implement that.

The path to scripts folder is basically where ever you are having your python scripts. In this repo it's in the folder that contains the README. It is given when making new jobs that it knows where to find the python scripts and the ./templates/ folder should be in the same folder. All of the bash scripts are made so that they run on the clusters we have at the Technical University of Denmark. It might be that you need to change them a little bit that they also run on your setup.

The adaptive biotechnologies format comes in two versions (see bellow). The code handles both. You can make a file with your sequences and their counts/frequencies but you just need to be consistent in naming these columns: Version 1: amino_acid (TCR sequence), productive_frequency (frequency of In frame seqs), frame_type (In, Out, Stop) Version 2:

aminoAcid (TCR sequence), count (templates/reads) (counts of reads), sequenceStatus

There is also a variation of Version 2 that can have count (reads) instead of count (templates/reads). I believe that you can also access the adaptive Immunoseq database by signing up on their website. There you can download complete examples of files: https://clients.adaptivebiotech.com/immuneaccess

Hope this helps! I am interested to hear about the outcome so please share :)

Best, Milena

Here are the first three lines of version 1:

rearrangement amino_acid frame_type rearrangement_type templates reads frequency productive_frequency cdr3_length v_family v_gene v_allele d_family d_gene d_allele j_family j_gene j_allele v_deletions d5_deletions d3_deletions j_deletions n2_insertions n1_insertions v_index n1_index n2_index d_index j_index v_family_ties v_gene_ties v_allele_ties d_family_ties d_gene_ties d_allele_ties j_family_ties j_gene_ties j_allele_ties sequence_tags v_shm_count v_shm_indexes antibody bio_identity v_resolved d_resolved j_resolved sample_name species locus product_subtype kit_pool sku test_name sample_catalog_tags total_templates productive_templates outofframe_templates stop_templates dj_templates total_rearrangements productive_rearrangements outofframe_rearrangements stop_rearrangements dj_rearrangements total_reads total_productive_reads total_outofframe_reads total_stop_reads total_dj_reads productive_clonality productive_entropy sample_clonality sample_entropy sample_amount_ng sample_cells_mass_estimate fraction_productive_of_cells_mass_estimate sample_cells fraction_productive_of_cells max_productive_frequency max_frequency counting_method primer_set sequence_result_status release_date upload_date sample_tags fraction_productive order_name kit_id total_t_cells total_templates_agg TTTACATATATCTGCCGTGGATCCAGAAGACTCAGCTGTCTATTTTTGTGCCAGCAGCCAAGATCTGGGGTCTCCTATGAACAGTAC Out VDJ 1 2 1.0109281331190165E-5 44 TCRBV05 TCRBV05-01 01 TCRBD02 TCRBD02-01 01 TCRBJ02 TCRBJ02-07 01 0 4 4 0 1 1 46 63 70 64 71 X+TCRBV05-01+TCRBJ02-07 TCRBV05-0101 TCRBD02-0101 TCRBJ02-0701 UT118 Mouse TCRB Survey 938 871 60 6 0 388 253 113 22 0 197838 183772 12696 1370 0 0.753148794 1.97061169 0.695283234 2.62053752 326.533997 50235 0.01733850927976155 0 0.450313419 0.418296784 v2 Mouse-TCRB-PD1x Published 07/14/2017 03:21:42 07/11/2017 12:00:00 0.9285714032618942
CTTTTACATATATCTGCCGTGGATCCAGAAGACTCAGCGTTCTATTTTTGTGCCAGCAGCCCCAGGGACAGAAACACAGAAGTCTTC CASSPRDRNTEVFF In VDJ 1 9 4.549176599035575E-5 4.8973728315521404E-5 42 TCRBV05 TCRBV05-01 01 TCRBD01 TCRBD01-01 01 TCRBJ01 TCRBJ01-01 01 4 0 5 1 3 0 48 61 -1 64 71 CASSPRDRNTEVFF+TCRBV05-01+TCRBJ01-01 TCRBV05-01
01 TCRBD01-0101 TCRBJ01-0101 UT118 Mouse TCRB Survey 938 871 60 6 0 388 253 113 22 0 197838 183772 12696 1370 0 0.753148794 1.97061169 0.695283234 2.62053752 326.533997 50235 0.01733850927976155 0 0.450313419 0.418296784 v2 Mouse-TCRB-PD1x Published 07/14/2017 03:21:42 07/11/2017 12:00:00 0.9285714032618942

Here are the first three lines of version 2:

nucleotide aminoAcid count (templates/reads) frequencyCount (%) cdr3Length vMaxResolved vFamilyName vGeneName vGeneAllele vFamilyTies vGeneNameTies vGeneAlleleTies dMaxResolved dFamilyName dGeneName dGeneAllele dFamilyTies dGeneNameTies dGeneAlleleTies jMaxResolved jFamilyName jGeneName jGeneAllele jFamilyTies jGeneNameTies jGeneAlleleTies vDeletion n1Insertion d5Deletion d3Deletion n2Insertion jDeletion vIndex n1Index dIndex n2Index jIndex estimatedNumberGenomes sequenceStatus cloneResolved vOrphon dOrphon jOrphon vFunction dFunction jFunction fractionNucleated vAlignLength vAlignSubstitutionCount vAlignSubstitutionIndexes vAlignSubstitutionGeneThreePrimeIndexes vSeqWithMutations TCAGCGCACAGAGCAGCGGGACTCAGCCATGTATCGCTGTGCCAGCAGCTTCAGCGGGGCGACACCGGGGAGCTGTTTTTTGGAGAA 1 3.996823685538164E-6 44 TCRBV07-0701 TCRBV07 TCRBV07-07 01 TCRBD02-0101 TCRBD02 TCRBD02-01 01 TCRBJ02-0201 TCRBJ02 TCRBJ02-02 01 3 2 6 3 1 3 37 51 52 59 61 1 Out VDJ
AGCACCTTGGAGCTGGGGGACTCGGCCCTTTATCTTTGCGCCAGCAGCTCGGGACAGGGTGTGTCCAATGAGCAGTTCTTCGGGCCA CASSSGQGVSNEQFF 1 2.5785959261536543E-6 45 TCRBV05-01
01 TCRBV05 TCRBV05-01 01 TCRBD01-0101 TCRBD01 TCRBD01-01 01 TCRBJ02-0101 TCRBJ02 TCRBJ02-01 01 3 6 0 3 1 6 36 49 50 59 65 1 In VDJ

sciencisto commented 1 year ago

Hi, both issues have been addressed now. There is now a single bash script to run TCRDivER. Additionally, there is a folder with input examples