kmayerb / tcrdist3

flexible CDR based distance metrics
MIT License
55 stars 17 forks source link

import_adaptive_file function drops subject data #42

Closed grhogg closed 3 years ago

grhogg commented 3 years ago

Hello, I am running into an issue with converting an exported Adaptive immunoseq data set, where the Sample Name column (eg 1000-Tissue_TCRB) is dropped upon conversion, and the output Subject column instead only contains the file name.

import pandas as pd
from tcrdist.repertoire import TCRrep
from tcrdist.adpt_funcs import import_adaptive_file, adaptive_to_imgt
df = import_adaptive_file(adaptive_filename = "Tissue_Bioidentity_TCRdist.tsv")
pd.options.display.max_colwidth = 100
df
Screen Shot 2021-01-25 at 12 33 16 PM

Any thoughts on how to trouble shoot would be greatly appreciated. Thanks!

kmayerb commented 3 years ago

To Address issue #42

We originally designed this function with one subject per file in mind.

If I understand, the file Tissue_Bioidentity_TCRdist.tsv have multiple subject ids that you want to retain

Once all the tests pass and this PR has been added to master:

you can reinstall with

pip install git+https://github.com/kmayerb/tcrdist3.git@master

Then you can add 'subject' to the new argument:

use_cols : list ['bio_identity', 'productive_frequency', 'templates', 'rearrangement', 'subject'] list of columns to retain from original input file. Add 'subject' if you wish to retain the subject or leave it blank to use filename as before.

grhogg commented 3 years ago

still no luck on my end.

pip uninstall tcrdist3
pip install git+https://github.com/kmayerb/tcrdist3.git
temp = pd.read_csv("Tissue_Bioidentity_TCRdist.tsv", sep='\t')
temp = temp.rename(columns={"sample_name": "subject"})
temp.head()
Screen Shot 2021-01-25 at 6 59 52 PM
filenameTSV = 'Tissue_Bioidentity_TCRdist.tsv'
with open(filenameTSV,'w') as write_tsv:
    write_tsv.write(temp.to_csv(sep='\t', index=False))

from tcrdist.adpt_funcs import import_adaptive_file, adaptive_to_imgt

df = import_adaptive_file(adaptive_filename = "Tissue_Bioidentity_TCRdist.tsv", use_cols = ['bio_identity', 'productive_frequency', 'templates', 'rearrangement', 'subject'])

pd.options.display.max_colwidth = 100
df
Screen Shot 2021-01-25 at 7 01 54 PM

Let me know if I'm making some dumb mistake. Thanks!

kmayerb commented 3 years ago

Well the fact that you successfully included ` use_cols = ['bio_identity', 'productive_frequency', 'templates', 'rearrangement', ‘subject’] Suggests that you reinstalled correctly.

I think the issue might be that you aren’t outputting the file with the column subject!.

Try this, see line 4 with new filename

temp = pd.read_csv("Tissue_Bioidentity_TCRdist.tsv", sep='\t')
temp = temp.rename(columns={"sample_name": "subject"})
temp.head()
temp.to_csv("Tissue_Bioidentity_TCRdist_with_subject.tsv", sep = “\t”, index = False)

then, reload that file not the original.

from tcrdist.adpt_funcs import import_adaptive_file, adaptive_to_imgt
df = import_adaptive_file(adaptive_filename = ("Tissue_Bioidentity_TCRdist_with_subject.tsv", use_cols = ['bio_identity', 'productive_frequency', 'templates', 'rearrangement', 'subject'])
grhogg commented 3 years ago

Hmmm, I don't think this is the issue. I had previously just overwritten the original to contain the column header "subject", but creating a separate file named "Tissue_Bioidentity_TCRdist_with_subject.tsv" doesn't seem to solve the problem either. I apologize that I'm getting so caught up on a minor import function.

temp = pd.read_csv("Tissue_Bioidentity_TCRdist.tsv", sep='\t')
temp = temp.rename(columns={"sample_name": "subject"})
temp.head()
temp.to_csv("Tissue_Bioidentity_TCRdist_with_subject.tsv", sep = "\t", index = False)
Screen Shot 2021-01-26 at 9 49 56 AM
from tcrdist.adpt_funcs import import_adaptive_file, adaptive_to_imgt

df = import_adaptive_file(adaptive_filename = "Tissue_Bioidentity_TCRdist_with_subject.tsv", use_cols = ['bio_identity', 'productive_frequency', 'templates', 'rearrangement', 'subject'])

pd.options.display.max_colwidth = 100
df
Screen Shot 2021-01-26 at 9 51 48 AM