Open jburos opened 7 years ago
The semantics of the path arguments to compare
(ex, --optitype
) are a bit confusing.
When you supply a path, hlarp
still has to have some logic for aggregating the different outputs and grouping them according to "sample"'s for comparison (example patient1
, patient2
...). What is happening, as far as I understand, is that it is not grouping them for comparison, so it looks like there are 2 different samples (normal-patient_id_N
, tumor-patient_id_T
). When it comes time for comparison OptiType1 (the first invocation) is missing tumor-patiend_id_T
and therefore it compares against the empty set, leading to nan
. Similarly for OptiType2.
The --optitype
invocation (Similarly to the other tools that hlarp
currently supports) takes it's sample value from the directory preceding the "timestamped" OptiType output.
For example, with the Upenn
data set I have OptiType results that live in:
/some/other/path/120013_TGACCA/2016_05_09_12_12_48/2016_05_09_12_12_48_result.tsv
hlarp
looks for something like 2016_05_09_12_12_48_result.tsv
. And a folder with the same timestamp that contains it. It actually takes the last folder (ie. it would ignore 2016_01_01_00_00_00/2016_01_01_00_00_00_result.tsv
), and return's the contents of that file (perhaps best not to drag up the terrible memories of why I wanted that). Afterwards it uses the name of the folder that contains the optitype
output folder as the sample
name (ie. 120013_TGACCA).
In general, this is a bit of a tricky problem. The most general solution would be to take a regex
with special identifiers (ex. sample_name
) to be used in a group so that one could extract samples. Or we can agree on a small set of simple conventions. The current convention's (sample name = containing directory) main benefit is simplicity of implementation. But I am happy to help implement any other solutions that could make your analysis easier.
Thanks again. Agreed, this would be a tricky problem to solve generally.
For my use case, I'm not using hlarp compare
to aggregate results across different samples -- for that use I used multiple
. When I want to compare, it would be nice for me to say --compare-all which would essentially ignore sample identifiers & just compare anything I add to the command line. That might be one way to get around the need for a regex / etc.
A mini hack I tried in the meantime was to generate the hlarp
report output, apply a regex to the source descriptions & pass this as hlarp compare --hlarp-file
. In this case, it also didn't compute the jaccard similarities... Perhaps i am not using this command correctly?
Here is the output:
## generate filtered hlarp report
$ hlarp multiple --optitype /path/to/tumor-patient_id_T_SCR/2016_11_13_11_31_48 --optitype /path/to/normal-patient_id_N/2016_11_13_10_00_34 | perl -p -e 's/OptiType_.*-(.*)_(T_SCR|N)/Optitype-$1/g' > hlarp_output
$ cat hlarp_output
class,allele,qualifier,confidence,run
1,A*02:01,,,Optitype-patient_id
1,A*02:07,,,Optitype-patient_id
1,B*07:02,,,Optitype-patient_id
1,B*44:03,,,Optitype-patient_id
1,C*07:02,,,Optitype-patient_id
1,C*14:03,,,Optitype-patient_id
1,A*02:01,,,Optitype-patient_id
1,A*02:01,,,Optitype-patient_id
1,B*07:02,,,Optitype-patient_id
1,B*44:03,,,Optitype-patient_id
1,C*07:02,,,Optitype-patient_id
1,C*14:03,,,Optitype-patient_id
## hlarp compare on reformatted output
$ hlarp compare --hlarp-file hlarp_output
Optitype-patient_id jaccard similarity: -nan
A*02:01 hlarp_outputx3
A*02:07 hlarp_output
B*07:02 hlarp_outputx2
B*44:03 hlarp_outputx2
C*07:02 hlarp_outputx2
C*14:03 hlarp_outputx2
Average jaccard similarity across runs: -nan
So I gathered from this that I had to pipe the various optitype outputs to different hlarp-files, each using the same sample identifiers to get the summary I wanted. Curious to know if this seems reasonable & if I'm barking up the right tree?
Thanks again --
So I gathered from this that I had to pipe the various optitype outputs to different hlarp-files, each using the same sample identifiers to get the summary I wanted. Curious to know if this seems reasonable & if I'm barking up the right tree?
It is reasonable. compare
is meant to look for similarity across all the arguments (you need more than one argument or otherwise you'll get nan
output) and then sample
's within those arguments. For hlarp-file
's the samples are in the last column called "run". So if you want to rename output it would be better to rename them to same patient id. You also don't need multiple
in this case, since it is the one adding "OptiType" to the sample name (ie, run column). You could just use hlarp optitype /path
instead.
@rleonid apologies I missed your ping about this yesterday - for now I am piping various outputs to compare
using hlarp-files
as described above. I ended up using multiple
(rather than optitype
, for example) so that I could include results from multiple samples in a single command.
So at this point the issue is a nice-to-have rather than a requirement for use. Have a nice holiday & happy to discuss if you like.
Ok, let's discuss in person when we're both back in the lab. It sounds like you have the HLA-types that you need.
When I use
hlarp
to compare HLA type from different samples for the same patient, I get output for each sample name separately. Is there a way to forcehlarp
to ignore the sample description in the parsing of output?For example: