nmdp-bioinformatics / hml

Histoimmunogenetics Markup Language (HML).
GNU Lesser General Public License v3.0
3 stars 5 forks source link

Use containers to group elements of the same type together #44

Open gyorgy-horvath-omixon opened 8 years ago

gyorgy-horvath-omixon commented 8 years ago

At multiple locations in the design the list of elements of the same type gets mixed with the elements of other types:

Proposal is the following, use container elements for these lists to clearly separate the elements having different functions:

We could also use this design to separate variant elements marking mismatches/novelties from variant elements reporting missing/undefined reference regions. Suggestion is using the following container elements:

bmilius-nmdp commented 8 years ago

Thanks for the suggestion! Could you give an example (psuedocode is fine) showing how this would look? And the use case where doing this would make reporting easier and/or more robust/accurate?

bmilius-nmdp commented 8 years ago

Gyorgy, are you suggesting changing from this:

    <consensus-sequence-block reference-sequence-id=...>
        <sequence>...</sequence>
        <variant start=... end=... reference-bases=... alternate-base=... >
            <variant-effect term=... \>
        ...
        <variant start=... end=... reference-bases=... alternate-bases=... >
            <variant-effect term=...\>
        <sequence-quality sequence-start=... sequence-end=... quality-score=... \>
       ...
        <sequence-quality sequence-start=... sequence-end=... quality-score=... \>
    </consensus-sequence-block>

to something like this:

<consensus-sequence-block-list>
        <consensus-sequence-block reference-sequence-id=...>
                <sequence>...</sequence>
                <variant-list>
                    <variant start=... end=... reference-bases=... alternate-base=... >
                        <variant-effect term="..." \>
                    ...
                    <variant start=... end=... reference-bases=... alternate-bases=... >
                        <variant-effect term=... \>
                </variant-list>
                <sequence-quality-list>
                      <sequence-quality sequence-start=... sequence-end=... quality-score=... \>
                      ...
                       <sequence-quality sequence-start=... sequence-end=... quality-score=... \>
               </sequence-quality-list>
           </consensus-sequence-block>
</consensus-sequence-block-list>

I'm still unclear of how the reference-extension-variant-list is used. What does it do that the variant-list can't do?

gyorgy-horvath-omixon commented 8 years ago

The reference-extension-variant-list would include those variants which are generated to fill in the gaps in the partial reference, e.g. introns in between the exons for many alleles. These variants are not 'real' variants since there is no change on the existing reference, no mismatch, no novelty. They are just there to report that the consensus sequence contains more information than the original reference.

By putting these into a separate tag would open up the space for further differentiation - e.g. when reporting an intron sequence when it was not present in the reference the 'reference-bases' attribute is not quite meaningful - or at least not having the same semantics when reporting 'real' variations - so it could be skipped. Also we could change the attribute and tag names according to this logic.

This might be a longer process to define properly, the suggested separation is just about the first step to not mix variants into a single list when they have completely different semantics: reporting mismatches and real differences compared to the reference or reporting extensions to the reference which are not necessarily causing a mismatch. If you suggest to include all these kind of changes within this single proposal - and not only the separation by these lists - please let me know and I'll put together an example for a more comprehensive change.