hammerlab / cohorts

Utilities for analyzing mutations and neoepitopes in patient cohorts
Apache License 2.0
20 stars 4 forks source link

Help with creating a cohort #227

Open js2dark opened 7 years ago

js2dark commented 7 years ago

Hello, I'm fairly new to python and I've been trying to use the cohorts library to mainly calculate neoantigens in my tumor samples. I have each tumor's processed BAM and VCF files but I'm having a difficult time trying to combine them into creating a cohort to proceed to counting neoantigens. If there is any step by step manual to creating a cohort, I would greatly appreciate if you could share it Thank you and hope to hear from you soon

jburos commented 7 years ago

Hi @js2dark - This is a great use case for cohorts; we are happy to help.

We have some worked examples of how to combine VCFs & clinical data into a cohort object. For example, an example using TCGA data, with some explanatory text, which references an earlier example for creating a cohort with clinical data only.

In all of these examples, the basic approach is the same: you loop over the units in your cohort (ie patients), creating a Patient object for each one. You then pass this list of Patients to create the Cohort object.

I will say, neither of the examples above includes the use of BAMs; to include these you will want to (when creating a Patient), also create Samples for each of your samples (tumor &/or normal). Then these Sample objects get included when creating the Patient.

For example:

normal_sample = Sample(
    is_tumor=False,
    bam_path_dna=bam_path_dna_normal)
tumor_sample = Sample(
    is_tumor=True,
    bam_path_dna=bam_path_dna_tumor,
    bam_path_rna=bam_path_rna_tumor,
    kallisto_path=kallisto_path,
    cufflinks_path=cufflinks_path)

These are then passed to the Patient object when it is instantiated:

patient = Patient(id=patient_id,
    benefit=row["is_benefit"],
    os=row["OS in days"],
    pfs=row[pfs_col], # Depends in RECIST choice
    deceased=row["is_deceased"],
    progressed=row["is_progressed"],
    progressed_or_deceased=row["is_progressed_or_deceased"],
    hla_alleles=row["hla_allele_list"],
    vcf_paths=snv_vcf_paths,
    normal_sample=normal_sample,  # <- here
    tumor_sample=tumor_sample,     # <- and here
    additional_data=row.to_dict())

NB: these examples are taken from the code we used recently to analyze some data from a cohort. Including that code here as possibly a more complete example, although beware it was using an earlier version of cohorts so some options may have changed since then.

Hope this gives you a good starting point. Feel free to get in touch if you run into sticky points or to give feedback on the documentation -- admittedly we need to do more on that front & to make these examples easier to find.

js2dark commented 7 years ago

Hello Jacki,

Thank you so much for your response and help

I was able to successfully make patients and create them into a Cohort.

When I was making Patients with just clinical features such as OS, PFS, deceased and etc. I faced no problem, but when I try to put vcf path by entering either "snv_vcf_paths=..." or "vcf-paths=....", I encounter a "TypeError: init() got an unexpected keyword argument 'snv_vcf_paths" or "vcf_paths".

I'm sorry if these are really basic questions as I'm still new to python Thank you so much for your help

Sincerely, Jason

jburos commented 7 years ago

@js2dark happy to hear that. Sorry the error you are seeing is my fault - the syntax changed in the latest version to variants=[vcf_path1,...]

Apologies.

js2dark commented 7 years ago

Hi Jackie, thank you for your help

I got the cohort to run and got the results but for for neoantigen_count, i've been getting "NaN"

the code i'm running looks like

import pandas as pd import numpy as np import sys from os import path, getcwd, environ from cohorts import Sample, Patient, Cohort, DataFrameLoader from cohorts.variant_stats import variant_stats_from_variant from cohorts.functions import missense_snv_count, neoantigen_count, snv_count

patient_1 = Patient(id="patient_1",variants=["/Users/Balthazars/Desktop/Hypermutation/IRCR_GBM_352_TL_SS.mutect_rerun_filter_vep.vcf"],os=70,pfs=24,deceased=True,progressed=True,benefit=False) patient_2 = Patient(id="patient_2",variants=["/Users/Balthazars/Desktop/Hypermutation/IRCR_BT15_847_T02_SS.mutect_pair_filter_vep.vcf"],os=100,pfs=50,deceased=True,progressed=False,benefit=True)

print patient_1

print patient_2

cohort = Cohort(patients=[patient_1,patient_2],cache_dir="/Users/Balthazars/Desktop/Hypermutation/Results") df = cohort.as_dataframe(on=neoantigen_count)

print df

df.to_csv(r'/Users/Balthazars/Desktop/Hypermutation/Results/results.csv',index=None,sep=',',mode='a')

Is it because due to absence of HLA alleles in my Patient object? Because when I run the code it says "HLA alleles did not exist for patient patient_1" and the same for patient_2 or is there another required file besides vcf file

If it's due to absence of HLA allele, Is there a builtin function within the cohorts for analyzing HLA allele?

Thank you so much

Sincerely, Jason

jburos commented 7 years ago

Hi @js2dark / Jason,

This looks great - happy to hear you're getting these results to run, albeit partially. Yes the predicted neoantigen piece requires data for HLA types on each patient. You would need to infer these from your WES / WGS sequencing, or know them for you patients by some other means.

Unfortunately predicted neoantigen data do depend on the HLA type data. You would pass this information to the Patient objects, as a list of HLA types much as you did for other features.

Just to be clear, this would look something like the following:


    Patient(id = "", 
    hla_alleles = ['A*01:01',
        'A*24:02',
        'B*08:01',
        'B*15:17',
        'C*07:01',
        'C*07:01'],
    ... )
js2dark commented 7 years ago

Hi Jackie,

I got the HLA type information for the patient that I'm running and annotated with " hla_alleles='A2' " or " hla_alleles='B2' " for corresponding patients and I've been using python 3.6 and updated all other packages including mhctools,tensorflow and etc.

But seems like from "base_commandline_predictor.py" under mhctools It cant process "from mhcnames.parsing_helpers import AlleleParseError"

I was wondering if syntax has changed for this under mhcnames or a different version is required to run this. my mhcnames version is 0.2.1 and mhctools is 1.5.0

Thank you Sincerely, Jason

On Tue, Jul 18, 2017 at 1:00 AM, Jacki Buros Novik <notifications@github.com

wrote:

Hi @js2dark https://github.com/js2dark / Jason,

This looks great - happy to hear you're getting these results to run, albeit partially. Yes the predicted neoantigen piece requires data for HLA types on each patient. You would need to infer these from your WES / WGS sequencing, or know them for you patients by some other means.

Unfortunately predicted neoantigen data do depend on the HLA type data. You would pass this information to the Patient objects, as a list of HLA types much as you did for other features.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hammerlab/cohorts/issues/227#issuecomment-315787062, or mute the thread https://github.com/notifications/unsubscribe-auth/AKp6fui8OydWLdKJ4MDWqtMTJw08PEPiks5sO4OXgaJpZM4OVAnZ .

-- Jason Kyungha Sa, Ph.D Institute for Refractory Cancer Research Samsung Medical Center

jburos commented 7 years ago

I'm going to see if I can reproduce this error you're getting - will get back to you. Thanks!

jburos commented 7 years ago

If I'm in a new python 3.5.2 session with mhcnames v 1.2.0, I see the same thing you're seeing:

Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from mhcnames.parsing_helpers import AlleleParseError
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'AlleleParseError'

It looks like in v 1.2.0 this should read:

from mhcnames import AlleleParseError
jburos commented 7 years ago

@js2dark can you send us a traceback from this error you're getting when you have a chance? This will help us determine where in the code this is coming up. Thanks so much!

jburos commented 7 years ago

@js2dark this issue should be fixed in the latest version of cohorts. It was caused by a conflict in the latest version of mhctools & the latest version of mhcnames.

If you do pip install git+git://github.com/hammerlab/cohorts it should be resolved. Thanks for the feedback & please let us know if you continue to run into issues --

js2dark commented 7 years ago

Hi Jackie,

Below is the traceback from the error I got previously,

Traceback (most recent call last): File "Neoantigen_cohorts.py", line 4, in from cohorts import Sample, Patient, Cohort, DataFrameLoader File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/cohorts/init.py", line 15, in from .cohort import Cohort File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/cohorts/cohort.py", line 40, in from mhctools import NetMHCcons, EpitopeCollection File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mhctools/init.py", line 12, in from .netmhc import NetMHC File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mhctools/netmhc.py", line 20, in from .netmhc3 import NetMHC3 File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mhctools/netmhc3.py", line 17, in from .base_commandline_predictor import BaseCommandlinePredictor File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mhctools/base_commandline_predictor.py", line 24, in from mhcnames.parsing_helpers import AlleleParseError ImportError: cannot import name 'AlleleParseError'

I updated the cohort through github link that you sent and updated mhctools to version 1.6.0 from 0.3.1 and mhcnames to 0.3.0 from 0.1.0as well. and now I'm getting the following errors

Using TensorFlow backend. Traceback (most recent call last): File "Neoantigen_cohorts.py", line 4, in from cohorts import Sample, Patient, Cohort, DataFrameLoader File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/cohorts/init.py", line 15, in from .cohort import Cohort File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/cohorts/cohort.py", line 41, in from mhctools import NetMHCcons, EpitopeCollection ImportError: cannot import name 'EpitopeCollection'

the versions of cohort is cohorts (0.6.4+14.g6926523)

Do I need to use different versions of the above packages or maybe there is another issue

Thank you and hope to hear from you soon

Sincerely, Jason

On Wed, Jul 19, 2017 at 12:51 AM, Jacki Buros Novik < notifications@github.com> wrote:

@js2dark https://github.com/js2dark can you send us a traceback from this error you're getting when you have a chance? This will help us determine where in the code this is coming up. Thanks so much!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hammerlab/cohorts/issues/227#issuecomment-316108632, or mute the thread https://github.com/notifications/unsubscribe-auth/AKp6foMIti0ucyBaJ6SCc_voLrcQHtUIks5sPNSNgaJpZM4OVAnZ .

-- Jason Kyungha Sa, Ph.D Institute for Refractory Cancer Research Samsung Medical Center

tavinathanson commented 7 years ago

Hey @js2dark,

Apologies for this being a bit confusing, but you'll actually need to use the versions of mhctools and mhcnames that cohorts now requires vs. upgrading to the latest versions of both of them. @jburos recently made a change in cohorts to pin mhcnames to 0.1.0 to solve this automatically.

If you pip install -r requirements.txt in cohorts, does that resolve the issue?

Tavi

js2dark commented 7 years ago

Hi Tavi,

I ran the commands and fixed the version to provenance_file_summary': {'cohorts': '0.5.5', 'isovar': '0.7.0', 'mhctools': '0.3.1', 'numpy': '1.13.0', 'pandas': '0.20.3', 'pyensembl': '1.0.3', 'scipy': '0.19.1', 'topiary': '0.1.2', 'varcode': '0.5.15'}}

but i'm getting the following errors

Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mhctools/base_commandline_predictor.py", line 137, in init run_command([self.program_name]) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mhctools/process_helpers.py", line 74, in run_command process = AsyncProcess(args, **kwargs) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mhctools/process_helpers.py", line 47, in init self.process = Popen(args, stdout=stdout, stderr=stderr) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 707, in init restore_signals, start_new_session) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 1326, in _execute_child raise child_exception_type(errno_num, err_msg) FileNotFoundError: [Errno 2] No such file or directory: 'netMHCcons'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "Neoantigen_cohorts.py", line 14, in df = cohort.as_dataframe(on=neoantigen_count) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/cohorts/cohort.py", line 367, in as_dataframe return apply_func(on, func_name(on), df).return_self(return_cols) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/cohorts/cohort.py", line 355, in apply_func df[col] = df.progress_apply(func, axis=1) ## depends on tqdm on prev line File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tqdm/_tqdm.py", line 530, in inner result = getattr(df, df_function)(wrapper, *args, *kwargs) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply ignore_failures=ignore_failures) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4358, in _apply_standard results[i] = func(v) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tqdm/_tqdm.py", line 526, in wrapper return func(args, kwargs) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/cohorts/cohort.py", line 351, in func = lambda row: on(row=row, cohort=self, kwargs) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/cohorts/functions.py", line 41, in wrapper kwargs) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/cohorts/functions.py", line 58, in wrapper kwargs) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/cohorts/functions.py", line 230, in neoantigen_count **kwargs) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/cohorts/cohort.py", line 977, in load_neoantigens filter_fn=filter_fn) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/cohorts/cohort.py", line 1012, in _load_single_patient_neoantigens process_limit=process_limit) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mhctools/netmhc_cons.py", line 41, in init process_limit=process_limit) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mhctools/base_commandline_predictor.py", line 139, in init raise SystemError("Failed to run %s" % self.program_name) SystemError: ('Failed to run netMHCcons', 'occurred at index 0')

Thank you

tavinathanson commented 7 years ago

Hey @js2dark, mhctools and therefore cohorts expects that you have NetMHC* tools (e.g. NetMHCcons) installed; we can't install those for you for license reasons, but the download page is at: www.cbs.dtu.dk/cgi-bin/nph-sw_request?netMHCcons.

You can also configure cohorts to use other tools (via mhctools), including our open source tool, https://github.com/hammerlab/mhcflurry.

Does that help?