mbakeranalecta / sam

Semantic Authoring Markdown
Other
79 stars 8 forks source link

Unannotated phrase found #163

Closed sfh1111 closed 6 years ago

sfh1111 commented 6 years ago

Hi, I have ccs data from Pacbio sequencing. I align it using bwa and created sam file. when I am trying to convert to xml using samparser I am getting:

 SAM parser warning: Unannotated phrase found: {~MkvUDEi$WXh~r>~~<l{q9dsK\zV^L*>HeihGjEPv?~HyUmc)`~WfTxITcv~1lwii{-arAoR[t~~;n~Zio~UMS~~3gf@f_~~M~\kmZZpzq.cTlh~~'^l~~QjuM`mkb\iza3Xgj_ld'3S[(yW9t~&i\Ym~6i[A$%Dg*e~UmiWv~s@S} If you are trying to insert curly braces into the document, use \{~MkvUDEi$WXh~r>~~<l{q9dsK\zV^L*>HeihGjEPv?~HyUmc)`~WfTxITcv~1lwii{-arAoR[t~~;n~Zio~UMS~~3gf@f_~~M~\kmZZpzq.cTlh~~'^l~~QjuM`mkb\iza3Xgj_ld'3S[(yW9t~&i\Ym~6i[A$%Dg*e~UmiWv~s@S}.
 SAM parser ERROR: Structure error: Unrecognized character entity found: &bCSIKHKSO; at line 389:

Any ideas way? Thanks

mbakeranalecta commented 6 years ago

SAM uses curly braces to annotate phrases in a text. Annotations look like this:

{John Wayne}(actor) stars in {Rio Bravo}(movie)

The curly braces denote the phrase and the parens contain the annotation. If you mark up a phrase but don't provide an annotation, the parser issues the unannotated phrase warning. Your data contains a string between curly braces that the parser interprets as a phrase.

SAM is UTF-8 by definition, but it also allows the representation of special characters using HTML character entities. &bCSIKHKSO;has the format of a character entity but is not one of the entities defined in HTML, therefore the parser reports the error.

Both the warning and the error can be avoided by prefixing the markup characters with backslash characters. \{ and \&.

Note that SAM is designed as an authoring format, chiefly for prose. As such, it has a lot of contextual markup. A pipe character, for instance, can be plain text or three different kinds of markup depending on context. There are a number of characters that can be interpreted as markup in context, including :|*_&\ and backtick. If you are putting data into a SAM paragraph, you may need to escape all of these characters with backslashes.

However, SAM does have a structure for passing through raw data without escaping. It is called the embed block and it looks like this:

 ```(=ccs)
      {~MkvUDEi$WXh~r>~~<l{q9dsK\zV^L*>HeihGjEPv?~HyUmc)`~WfTxITcv~1lwii{arAoR[t~~;n~Zio~UMS~~3gf@f_~~M~\kmZZpzq.cTlh~~'^l~~QjuM`mkb\iza3Xgj_ld'3S[(yW9t~&i\Ym~6i[A$%Dg*e~UmiWv~s@S}

It is the use of the = sign before the encoding name that tells SAM that this is an embedded encoding (to be interpreted) rather than a codeblock (to be displayed).

I'm not sure if that will be useful to you or not, based on your use case.