Use hlarp to compare HLA type from different samples

jburos commented 7 years ago

When I use hlarp to compare HLA type from different samples for the same patient, I get output for each sample name separately. Is there a way to force hlarp to ignore the sample description in the parsing of output?

For example:

$ hlarp compare --optitype /path/to/normal/optitype/output --optitype path/to/tumor/optitype/output
normal-patient_id_N jaccard similarity: -nan
    A*02:01 OptiType2
    A*29:02 OptiType2
    B*44:02 OptiType2
    B*44:03 OptiType2
    C*05:01 OptiType2
    C*16:01 OptiType2
tumor-patient_id_T  jaccard similarity: -nan
    A*02:01 OptiType1
    A*29:02 OptiType1
    B*44:02 OptiType1
    B*44:03 OptiType1
    C*05:01 OptiType1
    C*16:01 OptiType1
Average jaccard similarity across runs: -nan

rleonid commented 7 years ago

The semantics of the path arguments to compare (ex, --optitype) are a bit confusing.

When you supply a path, hlarp still has to have some logic for aggregating the different outputs and grouping them according to "sample"'s for comparison (example patient1, patient2 ...). What is happening, as far as I understand, is that it is not grouping them for comparison, so it looks like there are 2 different samples (normal-patient_id_N, tumor-patient_id_T). When it comes time for comparison OptiType1 (the first invocation) is missing tumor-patiend_id_T and therefore it compares against the empty set, leading to nan. Similarly for OptiType2.

The --optitype invocation (Similarly to the other tools that hlarp currently supports) takes it's sample value from the directory preceding the "timestamped" OptiType output.

For example, with the Upenn data set I have OptiType results that live in: /some/other/path/120013_TGACCA/2016_05_09_12_12_48/2016_05_09_12_12_48_result.tsv

hlarp looks for something like 2016_05_09_12_12_48_result.tsv. And a folder with the same timestamp that contains it. It actually takes the last folder (ie. it would ignore 2016_01_01_00_00_00/2016_01_01_00_00_00_result.tsv), and return's the contents of that file (perhaps best not to drag up the terrible memories of why I wanted that). Afterwards it uses the name of the folder that contains the optitype output folder as the sample name (ie. 120013_TGACCA).

In general, this is a bit of a tricky problem. The most general solution would be to take a regex with special identifiers (ex. sample_name) to be used in a group so that one could extract samples. Or we can agree on a small set of simple conventions. The current convention's (sample name = containing directory) main benefit is simplicity of implementation. But I am happy to help implement any other solutions that could make your analysis easier.

jburos commented 7 years ago

Thanks again. Agreed, this would be a tricky problem to solve generally.

For my use case, I'm not using hlarp compare to aggregate results across different samples -- for that use I used multiple. When I want to compare, it would be nice for me to say --compare-all which would essentially ignore sample identifiers & just compare anything I add to the command line. That might be one way to get around the need for a regex / etc.

A mini hack I tried in the meantime was to generate the hlarp report output, apply a regex to the source descriptions & pass this as hlarp compare --hlarp-file. In this case, it also didn't compute the jaccard similarities... Perhaps i am not using this command correctly?

Here is the output:

## generate filtered hlarp report
$ hlarp multiple --optitype /path/to/tumor-patient_id_T_SCR/2016_11_13_11_31_48 --optitype /path/to/normal-patient_id_N/2016_11_13_10_00_34 | perl -p -e 's/OptiType_.*-(.*)_(T_SCR|N)/Optitype-$1/g' > hlarp_output
$ cat hlarp_output
class,allele,qualifier,confidence,run
1,A*02:01,,,Optitype-patient_id
1,A*02:07,,,Optitype-patient_id
1,B*07:02,,,Optitype-patient_id
1,B*44:03,,,Optitype-patient_id
1,C*07:02,,,Optitype-patient_id
1,C*14:03,,,Optitype-patient_id
1,A*02:01,,,Optitype-patient_id
1,A*02:01,,,Optitype-patient_id
1,B*07:02,,,Optitype-patient_id
1,B*44:03,,,Optitype-patient_id
1,C*07:02,,,Optitype-patient_id
1,C*14:03,,,Optitype-patient_id

## hlarp compare on reformatted output
$ hlarp compare --hlarp-file hlarp_output
Optitype-patient_id jaccard similarity: -nan
    A*02:01 hlarp_outputx3
    A*02:07 hlarp_output
    B*07:02 hlarp_outputx2
    B*44:03 hlarp_outputx2
    C*07:02 hlarp_outputx2
    C*14:03 hlarp_outputx2
Average jaccard similarity across runs: -nan

So I gathered from this that I had to pipe the various optitype outputs to different hlarp-files, each using the same sample identifiers to get the summary I wanted. Curious to know if this seems reasonable & if I'm barking up the right tree?

Thanks again --

rleonid commented 7 years ago

So I gathered from this that I had to pipe the various optitype outputs to different hlarp-files, each using the same sample identifiers to get the summary I wanted. Curious to know if this seems reasonable & if I'm barking up the right tree?

It is reasonable. compare is meant to look for similarity across all the arguments (you need more than one argument or otherwise you'll get nan output) and then sample's within those arguments. For hlarp-file's the samples are in the last column called "run". So if you want to rename output it would be better to rename them to same patient id. You also don't need multiple in this case, since it is the one adding "OptiType" to the sample name (ie, run column). You could just use hlarp optitype /path instead.

jburos commented 7 years ago

@rleonid apologies I missed your ping about this yesterday - for now I am piping various outputs to compare using hlarp-files as described above. I ended up using multiple (rather than optitype, for example) so that I could include results from multiple samples in a single command.

So at this point the issue is a nice-to-have rather than a requirement for use. Have a nice holiday & happy to discuss if you like.

rleonid commented 7 years ago

Ok, let's discuss in person when we're both back in the lab. It sounds like you have the HLA-types that you need.

hammerlab / hlarp

Use hlarp to compare HLA type from different samples #24