Recommendations for a faster document preprocessing

jbjorne / TEES

Turku Event Extraction System

147 stars 44 forks source link

Recommendations for a faster document preprocessing #5

Closed ajjimeno closed 11 years ago

ajjimeno commented 11 years ago

I am processing MEDLINE and full text from PubMed Central. I am using the classification program with the preprocessing offered by default by TEES but it turns out to run very slow. I am wondering if you have recommendation for a faster approach for the preprocessing.

Thank you in advance, Antonio

jbjorne commented 11 years ago

Hi Antonio,

On a general level, try to keep your batch size reasonably large (the amount of documents in a single interaction XML file), while still fitting in memory, so you'll save on the preprocessing tool startup times. Just to be sure, you could also check that only sentences with detected BANNER entities are parsed, although this should be the default setting.

In the EVEX project (http://evexdb.org) all of PubMed and PubMed Central have already been processed with TEES, which includes preprocessing, so even if you are using an event extraction target different from those available in that data, you should be able to re-use the preprocessing data, which is available in the MySQL format.

Best Regards, Jari

On 03/22/2013 01:14 PM, ajjimeno wrote:

I am processing MEDLINE and full text from PubMed Central. I am using the classification program with the preprocessing offered by default by TEES but it turns out to run very slow. I am wondering if you have recommendation for a faster approach for the preprocessing.

Thank you in advance, Antonio

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5.

ajjimeno commented 11 years ago

Hi,

Thank you for the fast answer. I downloaded the files and tried to classify them. I tried the classification without the preprocessing either with the GE11 or the GE13 models using the command below and I get the exception shown after the command. I just modified Utils/InteractionXML/CorpusElements.py since the document entity had the attribute pmid instead of the id attribute expected by the program. Do you know where the problem could be coming from?

Thank you in advance, Antonio

python /usr/share/TEES/vpreprocessed/classify.py -i 0XX/099/medline09n0099-part-00000-parsed.xml.gz -o 0XX/099/medline09n0099-part-00000-parsed.xml.gz-tees -m GE11 --omitSteps PREPROCESS

Traceback (most recent call last): File "/usr/share/TEES/vpreprocessed/classify.py", line 190, in preprocessorParams=options.preprocessorParams, bioNLPSTParams=options.bioNLPSTParams) File "/usr/share/TEES/vpreprocessed/classify.py", line 78, in classify detector.classify(classifyInput, model, output, goldData=goldInput, fromStep=detectorSteps["CLASSIFY"], omitSteps=omitDetectorSteps["CLASSIFY"], workDir=workDir) File "/usr/share/TEES/vpreprocessed/Detectors/EventDetector.py", line 349, in classify EvaluateInteractionXML.run(self.edgeDetector.evaluator, xml, self.classifyData, edgeParse) File "/usr/share/TEES/vpreprocessed/Evaluators/EvaluateInteractionXML.py", line 370, in run return processCorpora(EvaluatorClass, predictedCorpusElements, goldCorpusElements, target, classSets, negativeClassId, entityMatchFunction) File "/usr/share/TEES/vpreprocessed/Evaluators/EvaluateInteractionXML.py", line 323, in processCorpora print evaluator.toStringConcise(title="Entities") File "/usr/share/TEES/vpreprocessed/Evaluators/AveragingMultiClassEvaluator.py", line 346, in toStringConcise string += self.classSet.getName(cls) TypeError: cannot concatenate 'str' and 'NoneType' objects

On Fri, Mar 22, 2013 at 11:26 PM, Jari Björne notifications@github.comwrote:

Hi Antonio,

On a general level, try to keep your batch size reasonably large (the amount of documents in a single interaction XML file), while still fitting in memory, so you'll save on the preprocessing tool startup times. Just to be sure, you could also check that only sentences with detected BANNER entities are parsed, although this should be the default setting.

In the EVEX project (http://evexdb.org) all of PubMed and PubMed Central have already been processed with TEES, which includes preprocessing, so even if you are using an event extraction target different from those available in that data, you should be able to re-use the preprocessing data, which is available in the MySQL format.

Best Regards, Jari

On 03/22/2013 01:14 PM, ajjimeno wrote:

I am processing MEDLINE and full text from PubMed Central. I am using the classification program with the preprocessing offered by default by TEES but it turns out to run very slow. I am wondering if you have recommendation for a faster approach for the preprocessing.

Thank you in advance, Antonio

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5.

— Reply to this email directly or view it on GitHubhttps://github.com/jbjorne/TEES/issues/5#issuecomment-15294050 .

jbjorne commented 11 years ago

Hi Antonio,

If any of your document elements do not have an id-attribute (unique on the level of that XML-file) the file is not valid interaction XML, and all kinds of things can go wrong with the processing. The EVEX data release XML format may not be quite the same as the TEES interaction XML, but differences should be minimal, so it should be pretty easy to update the files for compatibility. Please refer to the example on page https://github.com/jbjorne/TEES/wiki/Interaction-XML for the naming of the XML attributes and elements for use with the current version of TEES.

Regards, Jari

23.3.2013 3:52, ajjimeno kirjoitti:

Hi,

Thank you for the fast answer. I downloaded the files and tried to classify them. I tried the classification without the preprocessing either with the GE11 or the GE13 models using the command below and I get the exception shown after the command. I just modified Utils/InteractionXML/CorpusElements.py since the document entity had the attribute pmid instead of the id attribute expected by the program. Do you know where the problem could be coming from?

Thank you in advance, Antonio

python /usr/share/TEES/vpreprocessed/classify.py -i 0XX/099/medline09n0099-part-00000-parsed.xml.gz -o 0XX/099/medline09n0099-part-00000-parsed.xml.gz-tees -m GE11 --omitSteps PREPROCESS

Traceback (most recent call last): File "/usr/share/TEES/vpreprocessed/classify.py", line 190, in preprocessorParams=options.preprocessorParams, bioNLPSTParams=options.bioNLPSTParams) File "/usr/share/TEES/vpreprocessed/classify.py", line 78, in classify detector.classify(classifyInput, model, output, goldData=goldInput, fromStep=detectorSteps["CLASSIFY"], omitSteps=omitDetectorSteps["CLASSIFY"], workDir=workDir) File "/usr/share/TEES/vpreprocessed/Detectors/EventDetector.py", line 349, in classify EvaluateInteractionXML.run(self.edgeDetector.evaluator, xml, self.classifyData, edgeParse) File "/usr/share/TEES/vpreprocessed/Evaluators/EvaluateInteractionXML.py", line 370, in run return processCorpora(EvaluatorClass, predictedCorpusElements, goldCorpusElements, target, classSets, negativeClassId, entityMatchFunction) File "/usr/share/TEES/vpreprocessed/Evaluators/EvaluateInteractionXML.py", line 323, in processCorpora print evaluator.toStringConcise(title="Entities") File "/usr/share/TEES/vpreprocessed/Evaluators/AveragingMultiClassEvaluator.py", line 346, in toStringConcise string += self.classSet.getName(cls) TypeError: cannot concatenate 'str' and 'NoneType' objects

On Fri, Mar 22, 2013 at 11:26 PM, Jari Björne notifications@github.comwrote:

Hi Antonio,

On a general level, try to keep your batch size reasonably large (the amount of documents in a single interaction XML file), while still fitting in memory, so you'll save on the preprocessing tool startup times. Just to be sure, you could also check that only sentences with detected BANNER entities are parsed, although this should be the default setting.

In the EVEX project (http://evexdb.org) all of PubMed and PubMed Central have already been processed with TEES, which includes preprocessing, so even if you are using an event extraction target different from those available in that data, you should be able to re-use the preprocessing data, which is available in the MySQL format.

Best Regards, Jari

On 03/22/2013 01:14 PM, ajjimeno wrote:

I am processing MEDLINE and full text from PubMed Central. I am using the classification program with the preprocessing offered by default by TEES but it turns out to run very slow. I am wondering if you have recommendation for a faster approach for the preprocessing.

Thank you in advance, Antonio

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5.

— Reply to this email directly or view it on GitHubhttps://github.com/jbjorne/TEES/issues/5#issuecomment-15294050 .

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5#issuecomment-15329233.

ajjimeno commented 11 years ago

Hi Jari,

I have an example below from the EVEX data. I see that in the interaction XML there are usually more attributes, which I do not know exactly what they mean or how to generate them. I am wondering which ones are required by your program and which ones are not. I am wondering if the current version of TEES requires more features than the ones you used previously on EVEX.

Best regards, Antonio

On Sat, Mar 23, 2013 at 10:36 PM, Jari Björne notifications@github.comwrote:

Hi Antonio,

If any of your document elements do not have an id-attribute (unique on the level of that XML-file) the file is not valid interaction XML, and all kinds of things can go wrong with the processing. The EVEX data release XML format may not be quite the same as the TEES interaction XML, but differences should be minimal, so it should be pretty easy to update the files for compatibility. Please refer to the example on page https://github.com/jbjorne/TEES/wiki/Interaction-XML for the naming of the XML attributes and elements for use with the current version of TEES.

Regards, Jari

23.3.2013 3:52, ajjimeno kirjoitti:

Hi,

Thank you for the fast answer. I downloaded the files and tried to classify them. I tried the classification without the preprocessing either with the GE11 or the GE13 models using the command below and I get the exception shown after the command. I just modified Utils/InteractionXML/CorpusElements.py since the document entity had the attribute pmid instead of the id attribute expected by the program. Do you know where the problem could be coming from?

Thank you in advance, Antonio

python /usr/share/TEES/vpreprocessed/classify.py -i 0XX/099/medline09n0099-part-00000-parsed.xml.gz -o 0XX/099/medline09n0099-part-00000-parsed.xml.gz-tees -m GE11 --omitSteps PREPROCESS

Traceback (most recent call last): File "/usr/share/TEES/vpreprocessed/classify.py", line 190, in preprocessorParams=options.preprocessorParams, bioNLPSTParams=options.bioNLPSTParams) File "/usr/share/TEES/vpreprocessed/classify.py", line 78, in classify detector.classify(classifyInput, model, output, goldData=goldInput, fromStep=detectorSteps["CLASSIFY"], omitSteps=omitDetectorSteps["CLASSIFY"], workDir=workDir) File "/usr/share/TEES/vpreprocessed/Detectors/EventDetector.py", line 349, in classify EvaluateInteractionXML.run(self.edgeDetector.evaluator, xml, self.classifyData, edgeParse) File "/usr/share/TEES/vpreprocessed/Evaluators/EvaluateInteractionXML.py", line 370, in run return processCorpora(EvaluatorClass, predictedCorpusElements, goldCorpusElements, target, classSets, negativeClassId, entityMatchFunction) File "/usr/share/TEES/vpreprocessed/Evaluators/EvaluateInteractionXML.py", line 323, in processCorpora print evaluator.toStringConcise(title="Entities") File

"/usr/share/TEES/vpreprocessed/Evaluators/AveragingMultiClassEvaluator.py", line 346, in toStringConcise string += self.classSet.getName(cls) TypeError: cannot concatenate 'str' and 'NoneType' objects

On Fri, Mar 22, 2013 at 11:26 PM, Jari Björne notifications@github.comwrote:

Hi Antonio,

On a general level, try to keep your batch size reasonably large (the amount of documents in a single interaction XML file), while still fitting in memory, so you'll save on the preprocessing tool startup times. Just to be sure, you could also check that only sentences with detected BANNER entities are parsed, although this should be the default setting.

In the EVEX project (http://evexdb.org) all of PubMed and PubMed Central have already been processed with TEES, which includes preprocessing, so even if you are using an event extraction target different from those available in that data, you should be able to re-use the preprocessing data, which is available in the MySQL format.

Best Regards, Jari

On 03/22/2013 01:14 PM, ajjimeno wrote:

I am processing MEDLINE and full text from PubMed Central. I am using the classification program with the preprocessing offered by default by TEES but it turns out to run very slow. I am wondering if you have recommendation for a faster approach for the preprocessing.

Thank you in advance, Antonio

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5.

— Reply to this email directly or view it on GitHubhttps://github.com/jbjorne/TEES/issues/5#issuecomment-15294050

.

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5#issuecomment-15329233.

— Reply to this email directly or view it on GitHubhttps://github.com/jbjorne/TEES/issues/5#issuecomment-15335676 .

jbjorne commented 11 years ago

Hi Antonio,

Please find attached your original XML and a version updated for TEES 2.1. The most important things to note is that character offsets have been updated to the standard Java/Python format (+1 to end index) and that parses and tokenizations are placed in an "analyses" container element.

The BANNER-predicted entities must have the attributes given="True" and type="Protein" (for use with a GE model) and the document element must have an id.

Tokens and dependencies must have ids, unique in the scope of a single sentence. Dependencies refer to the tokens by ids, not by index as in the original file.

Finally, when running the program, remember to define the parse to be used with the -p switch. I can't test things at the moment, so I'm not 100% sure the attached XML is correct, but we'll see.

Regards, Jari

23.3.2013 13:57, ajjimeno kirjoitti:

Hi Jari,

I have an example below from the EVEX data. I see that in the interaction XML there are usually more attributes, which I do not know exactly what they mean or how to generate them. I am wondering which ones are required by your program and which ones are not. I am wondering if the current version of TEES requires more features than the ones you used previously on EVEX.

Best regards, Antonio

On Sat, Mar 23, 2013 at 10:36 PM, Jari Björne notifications@github.comwrote:

Hi Antonio,

If any of your document elements do not have an id-attribute (unique on the level of that XML-file) the file is not valid interaction XML, and all kinds of things can go wrong with the processing. The EVEX data release XML format may not be quite the same as the TEES interaction XML, but differences should be minimal, so it should be pretty easy to update the files for compatibility. Please refer to the example on page https://github.com/jbjorne/TEES/wiki/Interaction-XML for the naming of the XML attributes and elements for use with the current version of TEES.

Regards, Jari

23.3.2013 3:52, ajjimeno kirjoitti:

Hi,

Thank you for the fast answer. I downloaded the files and tried to classify them. I tried the classification without the preprocessing either with the GE11 or the GE13 models using the command below and I get the exception shown after the command. I just modified Utils/InteractionXML/CorpusElements.py since the document entity had the attribute pmid instead of the id attribute expected by the program. Do you know where the problem could be coming from?

Thank you in advance, Antonio

python /usr/share/TEES/vpreprocessed/classify.py -i 0XX/099/medline09n0099-part-00000-parsed.xml.gz -o 0XX/099/medline09n0099-part-00000-parsed.xml.gz-tees -m GE11 --omitSteps PREPROCESS

Traceback (most recent call last): File "/usr/share/TEES/vpreprocessed/classify.py", line 190, in
preprocessorParams=options.preprocessorParams, bioNLPSTParams=options.bioNLPSTParams) File "/usr/share/TEES/vpreprocessed/classify.py", line 78, in classify detector.classify(classifyInput, model, output, goldData=goldInput, fromStep=detectorSteps["CLASSIFY"], omitSteps=omitDetectorSteps["CLASSIFY"], workDir=workDir) File "/usr/share/TEES/vpreprocessed/Detectors/EventDetector.py", line 349, in classify EvaluateInteractionXML.run(self.edgeDetector.evaluator, xml, self.classifyData, edgeParse) File "/usr/share/TEES/vpreprocessed/Evaluators/EvaluateInteractionXML.py", line 370, in run return processCorpora(EvaluatorClass, predictedCorpusElements, goldCorpusElements, target, classSets, negativeClassId, entityMatchFunction) File "/usr/share/TEES/vpreprocessed/Evaluators/EvaluateInteractionXML.py", line 323, in processCorpora print evaluator.toStringConcise(title="Entities") File

"/usr/share/TEES/vpreprocessed/Evaluators/AveragingMultiClassEvaluator.py",

line 346, in toStringConcise string += self.classSet.getName(cls) TypeError: cannot concatenate 'str' and 'NoneType' objects

On Fri, Mar 22, 2013 at 11:26 PM, Jari Björne notifications@github.comwrote:

Hi Antonio,

On a general level, try to keep your batch size reasonably large (the amount of documents in a single interaction XML file), while still fitting in memory, so you'll save on the preprocessing tool startup times. Just to be sure, you could also check that only sentences with detected BANNER entities are parsed, although this should be the default setting.

In the EVEX project (http://evexdb.org) all of PubMed and PubMed Central have already been processed with TEES, which includes preprocessing, so even if you are using an event extraction target different from those available in that data, you should be able to re-use the preprocessing data, which is available in the MySQL format.

Best Regards, Jari

On 03/22/2013 01:14 PM, ajjimeno wrote:

I am processing MEDLINE and full text from PubMed Central. I am using the classification program with the preprocessing offered by default by TEES but it turns out to run very slow. I am wondering if you have recommendation for a faster approach for the preprocessing.

Thank you in advance, Antonio

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5.

— Reply to this email directly or view it on GitHubhttps://github.com/jbjorne/TEES/issues/5#issuecomment-15294050

.

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5#issuecomment-15329233.

— Reply to this email directly or view it on GitHubhttps://github.com/jbjorne/TEES/issues/5#issuecomment-15335676 .

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5#issuecomment-15335928.

jbjorne commented 11 years ago

OK, so apparently attachments are not allowed. Let's try to put the XML here.

Original:

<document pmid="4988753">
  <sentence charOffset="0-83" id="4988753.s0" text="Some requisites in systems leading to hybrid formation between bacterial endotoxins.">
    <entity charOffset="63-82" headOffset="73-82" id="4988753.T1" text="bacterial endotoxins" />
    <tokenization tokenizer="split-McClosky">
      <token POS="DT" charOffset="0-3" text="Some" />
      <token POS="VBZ" charOffset="5-14" text="requisites" />
      <token POS="IN" charOffset="16-17" text="in" />
      <token POS="NNS" charOffset="19-25" text="systems" />
      <token POS="VBG" charOffset="27-33" text="leading" />
      <token POS="TO" charOffset="35-36" text="to" />
      <token POS="NN" charOffset="38-43" text="hybrid" />
      <token POS="NN" charOffset="45-53" text="formation" />
      <token POS="IN" charOffset="55-61" text="between" />
      <token POS="JJ" charOffset="63-71" text="bacterial" />
      <token POS="NNS" charOffset="73-82" text="endotoxins" />
      <token POS="." charOffset="83-83" text="." />
    </tokenization>
    <parse parser="McCloskyPenn" pennstring="(S1 (S (S (NP (DT Some)) (VP (VBZ requisites) (PP (IN in) (NP (NP (NNS systems)) (VP (VBG leading) (PP (TO to) (NP (NP (NN hybrid) (NN formation)) (PP (IN between) (NP (JJ bacterial) (NNS endotoxins)))))))))) (. .)))" tokenizer="McClosky" />
    <parse parser="split-McClosky" tokenizer="split-McClosky">
      <dependency dep="4" gov="3" type="partmod" />
      <dependency dep="7" gov="4" type="prep_to" />
      <dependency dep="9" gov="10" type="amod" />
      <dependency dep="10" gov="7" type="prep_between" />
      <dependency dep="3" gov="1" type="prep_in" />
      <dependency dep="0" gov="1" type="nsubj" />
      <dependency dep="6" gov="7" type="nn" />
    </parse>
  </sentence>
</document>

Updated:

<document id="4988753" pmid="4988753">
  <sentence charOffset="0-84" id="4988753.s0" text="Some requisites in systems leading to hybrid formation between bacterial endotoxins.">
    <entity given="True" type="Protein" charOffset="63-83" headOffset="73-83" id="4988753.T1" text="bacterial endotoxins" />
    <analyses>
      <tokenization tokenizer="split-McClosky">
        <token id="bt_0" POS="DT" charOffset="0-4" text="Some" />
        <token id="bt_1" POS="VBZ" charOffset="5-15" text="requisites" />
        <token id="bt_2" POS="IN" charOffset="16-18" text="in" />
        <token id="bt_3" POS="NNS" charOffset="19-26" text="systems" />
        <token id="bt_4" POS="VBG" charOffset="27-34" text="leading" />
        <token id="bt_5" POS="TO" charOffset="35-37" text="to" />
        <token id="bt_6" POS="NN" charOffset="38-44" text="hybrid" />
        <token id="bt_7" POS="NN" charOffset="45-54" text="formation" />
        <token id="bt_8" POS="IN" charOffset="55-62" text="between" />
        <token id="bt_9" POS="JJ" charOffset="63-72" text="bacterial" />
        <token id="bt_10" POS="NNS" charOffset="73-83" text="endotoxins" />
        <token id="bt_11" POS="." charOffset="83-84" text="." />
      </tokenization>
      <parse parser="split-McClosky" tokenizer="split-McClosky" pennstring="(S1 (S (S (NP (DT Some)) (VP (VBZ requisites) (PP (IN in) (NP (NP (NNS systems)) (VP (VBG leading) (PP (TO to) (NP (NP (NN hybrid) (NN formation)) (PP (IN between) (NP (JJ bacterial) (NNS endotoxins)))))))))) (. .)))">
        <dependency id="sd_0" t2="bt_4" t1="bt_3" type="partmod" />
        <dependency id="sd_1" t2="bt_7" t1="bt_4" type="prep_to" />
        <dependency id="sd_2" t2="bt_9" t1="bt_10" type="amod" />
        <dependency id="sd_3" t2="bt_10" t1="bt_7" type="prep_between" />
        <dependency id="sd_4" t2="bt_3" t1="bt_1" type="prep_in" />
        <dependency id="sd_5" t2="bt_0" t1="bt_1" type="nsubj" />
        <dependency id="sd_6" t2="bt_6" t1="bt_7" type="nn" />
      </parse>
    </analyses>
  </sentence>
</document>

ajjimeno commented 11 years ago

Thank you. I managed to make it run! But I think that there is something missing since no a2 file with content is generated on some of the files I tested. I added the -p option set to "split-McClosky".

python /usr/share/TEES/v2.1/classify.py -i file.ixml.xml -o file.ixml.xml.tees -m GE13 --omitSteps PREPROCESS -p split-McClosky

Thanks! Antonio

On Sun, Mar 24, 2013 at 4:38 AM, Jari Björne notifications@github.comwrote:

OK, so apparently attachments are not allowed. Let's try to put the XML here.

Original:

Updated:

— Reply to this email directly or view it on GitHubhttps://github.com/jbjorne/TEES/issues/5#issuecomment-15341135 .

jbjorne commented 11 years ago

Nice to hear that it runs! Did you mean that in some cases no a2-file is generated, or that the a2-file is empty? If it's empty, that's OK if the system did not find anything in the text (in this case, there won't be interaction-elements in the output XML either).

Regards, Jari

24.3.2013 6:11, ajjimeno kirjoitti:

Thank you. I managed to make it run! But I think that there is something missing since no a2 file with content is generated on some of the files I tested. I added the -p option set to "split-McClosky".

python /usr/share/TEES/v2.1/classify.py -i file.ixml.xml -o file.ixml.xml.tees -m GE13 --omitSteps PREPROCESS -p split-McClosky

Thanks! Antonio

On Sun, Mar 24, 2013 at 4:38 AM, Jari Björne notifications@github.comwrote:

OK, so apparently attachments are not allowed. Let's try to put the XML here.

Original:

Updated:

— Reply to this email directly or view it on GitHubhttps://github.com/jbjorne/TEES/issues/5#issuecomment-15341135 .

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5#issuecomment-15351178.

ajjimeno commented 11 years ago

I managed to solve the problem. It was related to the parser I think. Even though I specified the -p option to "split-McClosky", it did not seem to fille the a2 files. I replaced "split-McClosky" to McCC and it works now.

Thanks! Antonio

On Mon, Mar 25, 2013 at 12:19 AM, Jari Björne notifications@github.comwrote:

Nice to hear that it runs! Did you mean that in some cases no a2-file is generated, or that the a2-file is empty? If it's empty, that's OK if the system did not find anything in the text (in this case, there won't be interaction-elements in the output XML either).

Regards, Jari

24.3.2013 6:11, ajjimeno kirjoitti:

Thank you. I managed to make it run! But I think that there is something missing since no a2 file with content is generated on some of the files I tested. I added the -p option set to "split-McClosky".

python /usr/share/TEES/v2.1/classify.py -i file.ixml.xml -o file.ixml.xml.tees -m GE13 --omitSteps PREPROCESS -p split-McClosky

Thanks! Antonio

On Sun, Mar 24, 2013 at 4:38 AM, Jari Björne notifications@github.comwrote:

OK, so apparently attachments are not allowed. Let's try to put the XML here.

Original:

Updated:

— Reply to this email directly or view it on GitHubhttps://github.com/jbjorne/TEES/issues/5#issuecomment-15341135 .

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5#issuecomment-15351178.

— Reply to this email directly or view it on GitHubhttps://github.com/jbjorne/TEES/issues/5#issuecomment-15358771 .

jbjorne commented 11 years ago

The name of the parse element to be used with the -p switch depends on which name the parse elements have in your input XML files (the name of the parse element is the value of its "parser" attribute).

Please keep in mind that if you use an incorrect name, TEES cannot detect this, as it is also possible for sentences to have no parse element at all (e.g. for cases of parser failure). If the parse is missing but the sentence has given entity-elements, TEES will still try to detect events, but with very low performance. So, please make sure you use the correct parse name, or it's quite likely your results will be nonsense.

Regards, Jari

25.3.2013 5:31, ajjimeno kirjoitti:

I managed to solve the problem. It was related to the parser I think. Even though I specified the -p option to "split-McClosky", it did not seem to fille the a2 files. I replaced "split-McClosky" to McCC and it works now.

Thanks! Antonio

On Mon, Mar 25, 2013 at 12:19 AM, Jari Björne notifications@github.comwrote:

Nice to hear that it runs! Did you mean that in some cases no a2-file is generated, or that the a2-file is empty? If it's empty, that's OK if the system did not find anything in the text (in this case, there won't be interaction-elements in the output XML either).

Regards, Jari

24.3.2013 6:11, ajjimeno kirjoitti:

Thank you. I managed to make it run! But I think that there is something missing since no a2 file with content is generated on some of the files I tested. I added the -p option set to "split-McClosky".

python /usr/share/TEES/v2.1/classify.py -i file.ixml.xml -o file.ixml.xml.tees -m GE13 --omitSteps PREPROCESS -p split-McClosky

Thanks! Antonio

On Sun, Mar 24, 2013 at 4:38 AM, Jari Björne notifications@github.comwrote:

OK, so apparently attachments are not allowed. Let's try to put the XML here.

Original:

Updated:

— Reply to this email directly or view it on GitHubhttps://github.com/jbjorne/TEES/issues/5#issuecomment-15341135 .

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5#issuecomment-15351178.

— Reply to this email directly or view it on GitHubhttps://github.com/jbjorne/TEES/issues/5#issuecomment-15358771 .

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5#issuecomment-15376451.

ajjimeno commented 11 years ago

Hi,

I checked and the output make sense for the files I checked so far.

Thanks! Antonio

On Mon, Mar 25, 2013 at 8:50 PM, Jari Björne notifications@github.comwrote:

The name of the parse element to be used with the -p switch depends on which name the parse elements have in your input XML files (the name of the parse element is the value of its "parser" attribute).

Please keep in mind that if you use an incorrect name, TEES cannot detect this, as it is also possible for sentences to have no parse element at all (e.g. for cases of parser failure). If the parse is missing but the sentence has given entity-elements, TEES will still try to detect events, but with very low performance. So, please make sure you use the correct parse name, or it's quite likely your results will be nonsense.

Regards, Jari

25.3.2013 5:31, ajjimeno kirjoitti:

I managed to solve the problem. It was related to the parser I think. Even though I specified the -p option to "split-McClosky", it did not seem to fille the a2 files. I replaced "split-McClosky" to McCC and it works now.

Thanks! Antonio

On Mon, Mar 25, 2013 at 12:19 AM, Jari Björne notifications@github.comwrote:

Nice to hear that it runs! Did you mean that in some cases no a2-file is generated, or that the a2-file is empty? If it's empty, that's OK if the system did not find anything in the text (in this case, there won't be interaction-elements in the output XML either).

Regards, Jari

24.3.2013 6:11, ajjimeno kirjoitti:

Thank you. I managed to make it run! But I think that there is something missing since no a2 file with content is generated on some of the files I tested. I added the -p option set to "split-McClosky".

python /usr/share/TEES/v2.1/classify.py -i file.ixml.xml -o file.ixml.xml.tees -m GE13 --omitSteps PREPROCESS -p split-McClosky

Thanks! Antonio

On Sun, Mar 24, 2013 at 4:38 AM, Jari Björne notifications@github.comwrote:

OK, so apparently attachments are not allowed. Let's try to put the XML here.

Original:

Updated:

— Reply to this email directly or view it on GitHub< https://github.com/jbjorne/TEES/issues/5#issuecomment-15341135> .

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5#issuecomment-15351178.

— Reply to this email directly or view it on GitHubhttps://github.com/jbjorne/TEES/issues/5#issuecomment-15358771

.

— Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/5#issuecomment-15376451.

— Reply to this email directly or view it on GitHubhttps://github.com/jbjorne/TEES/issues/5#issuecomment-15384534 .