Adding new labels Issue

kermitt2 / grobid

A machine learning software for extracting information from scholarly documents

https://grobid.readthedocs.io

Apache License 2.0

3.46k stars 446 forks source link

Adding new labels Issue #344

Open aishwaryabh opened 6 years ago

aishwaryabh commented 6 years ago

Hi - I have the following questions regarding Grobid:

1) My current task is to add new labels to sections of the pdf (ie: features of product, warranty, expiration date, etc). Basically, I am manually going through both the training and evaluation documents, and I am adding in my own labels such as'<features>' to the appropriate section. I was successfully able to add new labels to my environment by editing the startElement() and writeData() methods in grobid-trainer/src/main/java/org/grobid/trainer/sax/TEIFulltextSaxParser.java, and also by adding labels through /org/grobid/core/engines/label/TaggingLabels.java.

I am using batch testing to exeucute this. Here is my process: first I input injava -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.0-SNAPSHOT-onejar.jar -dIn ./trainingdata/test_in -dOut ./trainingdata/test_out2 -exe createTraining to create the training data. I then manually go through the fulltext.tei.xml documents to add new labels like I explained above to both training and evaluation., and I move the appropriate fulltext documents to the corpus and evaluation folder in /grobid-trainer/resources/dataset/fulltext. I then train the data by executing ./gradlew train_fulltext, I see that the labels that I manually inputted into the evaluation folder show up in the evaluation results.

However, my issue is that when I access the live grobid web api server (http://myhostname:8085), and I input in a pdf to processFullText through the GUI. I see that my labels that I manually added do not show up at all. Only the regular labels show up like paragraph and header. Also the format of this xml file is different from the myfile.fulltext.tei.xml files that I processed initially(ie: tables and images do not show up). How do I fix this issue that I am having? Am I testing the model correctly? If not, how would you recommend testing it?

2) What is the difference between token and field level results? I understand a field is a compilation of tokens, but what exactly is a token in this case?

Thank you so much for your help!

kermitt2 commented 6 years ago

Hi @aishwaryabh !

I think you did the hardest, you still need to indicate how you want to serialize the new labels in the final TEI result.

Look at the class org.grobid.core.document.TEIFormatter, in the method toTEITextPiece(), there is an iteration on all TaggingTokenCluster which are sequence of LayoutToken with the same label. You need to add in the conditions one corresponding to your new label, and create the corresponding XML element to encode it, for example:

        else if (clusterLabel.equals(TaggingLabels.MY_NEW_LABEL)) {
                String clusterContent = LayoutTokensUtil.normalizeDehyphenizeText(cluster.concatTokens());
                    Element newElement = teiElement("rs", clusterContent);
                    newElement.addAttribute(new Attribute("type", "new_label"));
                    curDiv.appendChild(newElement);
            }

and you should get outputed in the final resulting TEI:

     bla bla bla <rs type="new_label">text tagged</rs> bla bla bla

(note: in this example there is dehyphenization of the annotated text, you might want to keep the text untouched depending on the nature of the information you want to annotate and remove the dehyphenization)

Table and figures are positioned at the end of the final TEI. The final resulting TEI is a logical representation of the input document, independent from a particular presentation.

Token is word level tokenization, with puntuations as tokens - it is called token because it does not correspond to words as a linguist would define it (tokenization here is not linguistic motivated, it's just to achieve our NLP tasks). The class LayoutToken is a representation of a token with all its PDF layout informations (coordinates, font, size, etc.)

Field represents a complete sequence with the same label, so an annotated entity, comprising several words, so several tokens.

Hope this is helpful !

aishwaryabh commented 6 years ago

Thank you so much @kermitt2 ! The labeling worked! I am fairly new to machine learning in practice so I have an other quick question:

Grobid initially had many files in the corpus for training, so I deleted them and replaced them with my own pdfs. However, I only had about 8 pdf's in the training to generate the model (that I manually went through and put in my own labels like "features"), and now when I try to generate new XML files, the file skips certain paragraphs and tables. However, when I run the data with the documents you already provided in the corpus folder along with my 8 pdfs, I get a better result in terms of portions of the document not being skipped.

But, the documents that are already in the Grobid corpus file do not have labels parsed that I created. So my question is that would you recommend training with your initial pdfs or removing them?

kermitt2 commented 6 years ago

You added some new labels to the full text model, but if I have understood well, you did not removed the existing labels? So the model will still try to identify paragraphs, section headers, etc.

My guess is that by removing the previous existing examples, you make much harder for grobid to identify paragraphs, figures, table, etc. because the tool has less examples.

One option would be to add your own labels to the 20 existing training documents (even if they are not present in these documents, it's also good, because it will be reliable negative examples), and add your new 8 examples documents with the correct labels for paragraphs, figure, tables, etc.

The drawback of adding new labels to the full text model is that you need to cover all the labels (old + new) in all the training examples. As an alternative, you could also have created a new model with only your new labels, and apply this new model to some sections identified by the full text model, in cascade. Then the training data for your new labels will be independent from the training data for the full text model.

This approach of creating a new model is taken by some GROBID modules for identifying astronomical entities, software mentions or physical quantities.

aishwaryabh commented 6 years ago

Hi @kermitt2, Ok that makes sense! The problem with what I am trying to do is that I also want to identify tables, figures, and other entities that I label (like warranty, orderingInfo, etc). Therefore, I don't think I should make my own labeler. So I am currently taking the approach in which I label each of the documents with the appropriate labels, and I assume that grobid then learns the appropriate categories using the fulltext model. I am however having the following problems:

1) So I have a table that I want to label as "specifications" because it describes the specifications of the product. So inside the "table" label, I also include specifications, but this messes up the format. When I test it, the text becomes all jumbled up and is spread out through the bottom of the document. How would you recommend classifying tables for my situation?

2) I see grobid picks up on the text in the middle of document. However if there is some kind of "overview" or abstract in the beginning or conclusion at the end of the document, processing the fulltext document does not pick up on it. I understand there are other models like segmentation, so would I have to train those as well in order for fulltext to pick up on the missing components? From my understanding, fulltext seems to load all the models from my server, even though I have only modified fulltext.

3) Does Grobid label the text based on key words or placement? For example, I have a label named "features." The key word "features" shows up in the text that is supposed to be labeled as "features" quite often, so is grobid using the frequency of words to label? Or am I mis-understanding something?

Again, thank you so much for your help and your prompt responses!

kermitt2 commented 6 years ago

Hi @aishwaryabh

Using a new model would actually solve these issues 1. and 2. because you could then call it from any structures where it is relevant, for instance for the abstract, the content of tables, the content of paragraph, etc. and not from non-relevant structures (for instance from formula structure or from the bibliographical section).

If you link the new labels to the full text model only, the labels will be present only for the body of the article, not the header/annexes/etc. and not in substructures of the body like the table. So introducing a new model is preferable.

See the following example in grobid-astro, method processPDF We have a new model for recognizing astronomical entities, only one label, quite simple. In the method above, the astro. entity extraction is applied to some "zones" identified by their label name, so to the LayoutToken sequence of these zones. We select like this the relevant structures where such entities could appear. We could imagine passing new features to the classifier specific to the structure where it is called, or, for example, even train a classifier specific to table content.

The drawback is that you need to reinject yourself these new annotations in the final TEI, if you want a fully structured TEI.

GROBID is 100% machine learning, there is no predefined keyword to mark the structures. The machine learning model, CRF, is learning which kind of wording is starting or closing a structure, given the content around, the layout, the spacing, etc. So the frequency of the words labeled in the training data will be used.

aishwaryabh commented 6 years ago

Hey @kermitt2, Ok sounds good! Would you mind giving me some guidance as to how to create my own model? I am very new to all of this. From what I know, I would have to create a new directory in the grobid folder and include the folders that are in the astro project. Then I would have to modify the parser (so AstroParser in this example) and the TaggingLabels. I am confused as to how I would run this new model though? And for the training data, can I use the same pieces of data that I already inputted into the fulltext model? Sorry for all of the questions; I just really want to make sure I have a clear understanding before modifying what I already have. Thanks!

kermitt2 commented 6 years ago

Yes this is the process. Concretely, I suggest to use all the grobid-astro repo as a template for your grobid sub-module:

if your model if called for instance the feature model (or product model maybe, I keep feature as an example here), replace basically all astro strings with feature (you will have a grobid-feature subdirectory, a FeatureParser class, a FeatureEntity class, etc.)
so rename everything that has astro in it with feature, including the paths and files under grobid-features/resources, and remove the astro training files
introduce your new labels in org.grobid.core.engines.labels.FeatureTaggingLabels
in the grobid core class org.grobid.core.GrobidModels (so the normal main grobid folder), declare your new model FEATURE("feature"),
put your training files under grobid-feature/resources/dataset/feature/corpus, but now they only need to contain your new labels (remove the full text tags). You don't need to put a complete document, just the paragraphs containing your new labels, and some paragraph without annotations to balance positive and negative annotations so that your precision/recall balance is what you expect
modify FeatureEntity to represent what kind of entity you are extracting (so for instance for a Product entity, adding fields for warranty, expiration date, ...)
modify the method extractFeatureEntities in FeatureParser.java to build the FeatureEntity object according to the labeled fields
there's a lexicon of common astronomical names used by the class AstroLexicon, you don't need it of course but you can replace this lexicon with vocabulary that you consider useful for your domain, it will be used as features in the machine learning model to improve accuracy/generalisation
on the training part, modify the class org.grobid.trainer.FeatureAnnotationSaxHandler so that you map the XML tag/attributes you use in your TEI training files to your FeatureTaggingLabels

Normally at this stage you will have a complete standalone tagger for your task with all the functionalities (batch, web REST service, Java API) usable, able to process text and PDF.

aishwaryabh commented 6 years ago

Hi @kermitt2 , Thank you so much for your detailed response! It really helped me understand how to make my own model. So I made my model and I was trying to use maven to train the documents using: mvn generate-resources -Ptrain_astro. But then I get the error with missing dependencies: Would I have to move those files into the grobid/grobid-astro/lib directory? So then I instead tried training the files through batch testing, but I kept getting an error that there is no such model as astro in TrainerRunner.java. Am I implementing the model correctly? Again thank you so much for your help; I really appreciate it!

kermitt2 commented 6 years ago

Hello ! I forgot a couple of things, sorry:

you need to update similarly the maven file pom.xml under your new sub-project by also replacing all occurrences of the string astro with your project name string, so for instance feature.
You will then have a maven goal train_feature that you will be able to call similarly:

mvn generate-resources -Ptrain_features

It should create the lib and war packages for your new project (target/grobid-features-0.5.1-SNAPSHOT.war, etc.)

I also forget to mention it, for using the web services and demo web console, you need also to update some files under src/main/webapp/, in particular src/main/webapp/WEB-INF/web.xml with the new name of the service and src/main/webapp/grobid/grobid-astro.js if you only need the batch mode, you don't need to modify this webapp sub-directory

aishwaryabh commented 6 years ago

I'm still getting the same error as above for the dependencies. Is this because I am using the developmental version of grobid (0.6.0-SNAPSHOT)?

kermitt2 commented 6 years ago

I see, you will need to update the current development version of GROBID, it should be 0.5.2-SNAPSHOT. The version of GROBID indicated in the maven pom.xml file of your new sub-project must be the same as the one of the GROBID version you are using.

aishwaryabh commented 6 years ago

Hi @kermitt2, I updated my development version of grobid, but I still cannot seem to figure out why the build is failing. This is my first time using maven, so I am not too familiar with it!

So I am getting the following error message when simply running mvn installin my grobid/grobid-astro directory:

[ERROR] Failed to execute goal on project grobid-astro: Could not resolve dependencies for project org.grobid:grobid-astro:war:0.5.2-SNAPSHOT: Failure to find fr.limsi.wapiti:wapiti:jar:1.5.0 in file:////opt/grobid3/grobid/grobid-astro/lib/ was cached in the local repository, resolution will not be reattempted until the update interval of 3rd-party-local-repo has elapsed or updates are forced -> [Help 1]

Also while building I get the following warning about a missing POM file:

[INFO] ------------------------------------------------------------------------
[INFO] Building grobid-astro 0.5.2-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[WARNING] The POM for fr.limsi.wapiti:wapiti:jar:1.5.0 is missing, no dependency information available

Lastly, where exactly can I find build.xml? Once again, thank you so much for your help! :)

kermitt2 commented 6 years ago

Sorry it's the pom.xml file under grobid/grobid-astro/pom.xml (not the build.xml, I'll correct my comment above too).

Normally if you build grobid, then grobid-astro, the dependencies should work fine. Be sure to have a maven version up to date.

aishwaryabh commented 6 years ago

Hi, I believe I already have the most updated version of maven. Also, I also think I correctly built grobid, but I am not able to build grobid-astro. Here are some details:

This is my current version of maven:

And when I run sudo apt-get --only-upgrade install maven I get the following:

I also believe that grobid is built correctly, as when I run ./gradlew run, grobid is successfully run.

So when I run 'mvn clean install -U' under grobid/grobid-astro, I get the following error:

Downloading: file:////opt/grobid3/grobid/grobid-astro/lib/fr/limsi/wapiti/wapiti/1.5.0/wapiti-1.5.0.pom Downloading: http://download.java.net/maven/2/fr/limsi/wapiti/wapiti/1.5.0/wapiti-1.5.0.pom Downloading: https://dl.bintray.com/rookies/maven/fr/limsi/wapiti/wapiti/1.5.0/wapiti-1.5.0.pom Downloading: https://repo.maven.apache.org/maven2/fr/limsi/wapiti/wapiti/1.5.0/wapiti-1.5.0.pom [WARNING] The POM for fr.limsi.wapiti:wapiti:jar:1.5.0 is missing, no dependency information available

[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 7.426 s
[INFO] Finished at: 2018-09-12T14:36:33-04:00
[INFO] Final Memory: 13M/366M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project grobid-astro: Could not resolve dependencies for project org.grobid:grobid-astro:war:0.5.2-SNAPSHOT: Could not find artifact fr.limsi.wapiti:wapiti:jar:1.5.0 in 3rd-party-local-repo (file:////opt/grobid3/grobid/grobid-astro/lib/)

Am I supposed to be adding something to the grobid/grobid-astro/lib directory? I think I'm missing something but I'm not able to put my finger on it? Thanks :)

kermitt2 commented 6 years ago

grobid-astro expects the wapiti lib to be under ~/.m2/repository/fr/limsi/wapiti/wapiti/1.5.0/but it's not not copied there where building grobid, I see now the problem. I push a fix for grobid-astro for having the localLibs under grobid/grobid-astro/lib/, however it's not very satisfactory.
But if you update grobid-astro, it should work I think now.

aishwaryabh commented 6 years ago

Thanks @kermitt2! The grobid-astro folder now builds, but the test now fails. I think this is due to it not finding grobid-home? However I see that grobid-home is defined in pom.xml under plugins. Here is my output:

`-------------------------------------------------------
 T E S T S
-------------------------------------------------------

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.grobid.core.engines.AstroParserTest
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-jdk14/1.7                                                                             .25/slf4j-jdk14-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1                                                                             .7.25/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
Sep 13, 2018 1:05:10 AM org.grobid.core.main.GrobidHomeFinder getGrobidHomePathO                                                                             rLoadFromClasspath
WARNING: No Grobid property was provided. Attempting to find Grobid home in the                                                                              current directory...
Sep 13, 2018 1:05:10 AM org.grobid.core.main.GrobidHomeFinder getGrobidHomePathO                                                                             rLoadFromClasspath
WARNING: Attempting to find and in the classpath...
GROBID astro initialisation failed: org.grobid.core.exceptions.GrobidPropertyExc                                                                             eption: [GENERAL] No Grobid home was found in classpath and no Grobid home locat                                                                             ion was not provided
org.grobid.core.exceptions.GrobidPropertyException: [GENERAL] No Grobid home was                                                                              found in classpath and no Grobid home location was not provided
        at org.grobid.core.main.GrobidHomeFinder.fail(GrobidHomeFinder.java:92)
        at org.grobid.core.main.GrobidHomeFinder.getGrobidHomePathOrLoadFromClas                                                                             spath(GrobidHomeFinder.java:128)
        at org.grobid.core.main.GrobidHomeFinder.findGrobidHomeOrFail(GrobidHome                                                                             Finder.java:54)
        at org.grobid.core.utilities.GrobidProperties.getInstance(GrobidProperti                                                                             es.java:102)
        at org.grobid.core.engines.AstroParserTest.setUpClass(AstroParserTest.ja                                                                             va:36)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.                                                                             java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces                                                                             sorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(Framework                                                                             Method.java:47)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCal                                                                             lable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMe                                                                             thod.java:44)
        at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.                                                                             java:24)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
        at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provide                                                                             r.java:252)
        at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4                                                                             Provider.java:141)
        at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider                                                                             .java:112)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.                                                                             java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces                                                                             sorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(                                                                             ReflectionUtils.java:189)
        at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke                                                                             (ProviderFactory.java:165)
        at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(Provi                                                                             derFactory.java:85)
        at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(Fork                                                                             edBooter.java:115)
        at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:                                                                             75)
Sep 13, 2018 1:05:10 AM org.grobid.core.main.GrobidHomeFinder getGrobidHomePathO                                                                             rLoadFromClasspath
WARNING: No Grobid property was provided. Attempting to find Grobid home in the                                                                              current directory...
Sep 13, 2018 1:05:10 AM org.grobid.core.main.GrobidHomeFinder getGrobidHomePathO                                                                             rLoadFromClasspath
WARNING: Attempting to find and in the classpath...
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.109 sec <<< FA                                                                             ILURE!
org.grobid.core.engines.AstroParserTest  Time elapsed: 0.108 sec  <<< ERROR!
org.grobid.core.exceptions.GrobidPropertyException: [GENERAL] No Grobid home was                                                                              found in classpath and no Grobid home location was not provided
        at org.grobid.core.main.GrobidHomeFinder.fail(GrobidHomeFinder.java:92)
        at org.grobid.core.main.GrobidHomeFinder.getGrobidHomePathOrLoadFromClas   `

I have a quick question about the training files. So I already have over 20 TEI files that have been manually hand-annotated to the grobid fulltext model.

Here is an example of the text:

<?xml version="1.0" ?>
<tei>
    <teiHeader>
        <fileDesc xml:id="0"/>
    </teiHeader>
    <text xml:lang="en">
            <productDesc>Data centers demand high performance<lb/> networking solutions....</productDesc>

Can I use these same files? Or would I have to modify them them to have tags like <rs type="productDesc">?

Finally, I was trying to create training data and create TEI files using the batch command, but I realized that there is no grobid-astro-0.5.2-SNAPSHOT.one-jar.jar. How would I do this? Thanks for your help!

kermitt2 commented 6 years ago

Regarding the test error, apparently grobid-astro does not find the grobid-home path (path to grobid/grobid-home). By default it is ../grobid-home/ from where you launch the test, but you can also set the path in grobid-astro/src/main/resources/grobid-astro.properties.

The test will anyway not run because there is no model by default (see the readme of grobid-astro), you will need to train one first. You can by-pass the test with the maven option -DskipTests. You will have a grobid-astro-0.5.2-SNAPSHOT.one-jar.jar file by skipping the tests. However, if you want to create training data, you need a first model to be trained to bootstrap the process.

Regarding the training data, what is expected is something similar to the training data of grobid-astro, so what is under grobid-astro/resources/dataset/astro/corpus/ so basically only paragraphs. The way you are encoding these files need to be consistent with the XML parser which is used and that you can modify accordingly (class org.grobid.trainer.AstroAnnotationSaxHandler).

aishwaryabh commented 6 years ago

Ok that makes sense! So just to clarify, since the astronomical model identifies only paragraphs right now, I wanted to process the entire document (not just paragraphs) (ie: I want all images, tables, headers, text, etc). Currently, when I process the PDFs, the output is missing some headers and portions of the document. From my understanding, I would have to modify AstroParser's processPdf method, but I am not sure where to begin from. Any advice would be much appreciated! :)

kermitt2 commented 6 years ago

The astronomical model identifies astronomical entities in text, any text (the fact that the text is stored in <p> in the training data is just for convenience). So it is independent from the place the text occurs - it could be the text of the title, of the abstract, of a paragraph, of a table caption, etc.

In processPDF method, GROBID is first applied to structure the text, then we apply grobid astronomical model to each structure containing text we want to process - the structures (or zones) are identified by their label name. The text of each "zone" is in practice a LayoutToken sequence (a list of textual token augmented with layout information) that we get from the zone.

aishwaryabh commented 6 years ago

Oh I see. Hypothetically speaking, instead of creating a new model, could I just modify the full text model? I still want my model to identify and explicitly mark whether something is the title of the text, abstract, figures, or tables. Even if I were to create a new model, I understand that I would have to train it from scratch to identify these components as abstract, figure, table, etc. However, fulltext model already seems to do a decent job at doing that. My only issue with the fulltext model was with all of the text not showing up (ie: the abstract and some tables)

For instance, I generated all of the TEI training modules using batch testing for a file called example.pdf. I have a table toward the end of example.pdf, but that was skipped by example.training.fulltext.tei.xml. However, I found the table at example.training.references.referenceSegmenter.tei.xml. How would I modify FullTextParser.java in order to also account for this table that showed up elsewhere?

kermitt2 commented 6 years ago

The principle of GROBID is to apply the models in cascade. We apply first the segmentation model to have the main document zones (header, body, bibligraphical section, etc.), then we apply the header model on the header part, the full text model to the body part (full text model should have been named document body model, but it was a bit long :) ), table model is apply to table area found by the full text model, and so on. This is very useful and efficient for exploiting and balancing training data, because each model can use training data for its scope.

Here is roughly the cascading hierarchy (Table and Figure models would need to be added under Fulltext model):

screenshot from 2018-09-14 04-47-25

If you simply add labels to the full text model, you will limit these labels to the scope of the fulltext model which the document body (not the header), and you won't be able to further annotate structures that are parsed after the application of the full text model (table and figure in particular).

By having an independent model, you can apply this model to any structures found by grobid in a first step (including what has been found the fulltext model). There is no need to identify any structures already identified by grobid, you can just focus on your new model and new labels.

Regarding your last question, if a table appears in the referenceSegmenter training data, it means it's an error of the above model, the segmentation model (Segmenter in the diagram) which has erroneously classify the zone as bibliographic section instead of body section.

aishwaryabh commented 6 years ago

Ok got it! So I basically have to choose which models I want to use by creating a brand new model with my own personal labels. I also have a quick question about tables in the fulltext model in the picture below. It is successfully able to recognize the component as a table, but I am not able to see the contents in it. So I can correctly assume then that the actual content of the table is in another separate model (like segmenter for the above example). Is my assumption right?

AravindSanga commented 5 years ago

how to pass single reference and process with grobid

kermitt2 commented 5 years ago

Not the right place to ask, but you can have a look here: https://grobid.readthedocs.io/en/latest/Grobid-service/#apiprocesscitation and here https://grobid.readthedocs.io/en/latest/Grobid-batch/#processrawreference

(if I understand correctly the question)

jribault commented 5 years ago

Hi,

Thanks for the tutorial, but I still can't make it work... My objective is to build a MyPatent module to identify patent number. I know there is already something in grobid to handle patent, but I'm trying to learn :) Everything compile, but training doesn't work. I'm getting a wapiti error :

warning: missing tokens, cannot apply pattern error: no train data loaded

Do you have any idea where this can coming from ?

Best regards,

jribault commented 5 years ago

Just in case other people are interested, I'm starting all over again and write here what I'm doing.

download grobid 0.5.4 and untar inside grobid dir : ./gradlew clean install

download grobid-astro and put it as a child of grobid-0.5.4 inside grobid/grobid-astro :

edit build.gradle and update grobid version dependency to 0.5.4 compile 'org.grobid:grobid-core:0.5.4' compile 'org.grobid:grobid-trainer:0.5.4'
in AstroParser.java change --Pair<String, LayoutTokenization> featSeg to org.apache.commons.lang3.tuple.Pair<String, LayoutTokenization> featSeg -- I also commented what was concerning the HEADER and the ANNEX as I'm only interested by the BODY then ./gradlew clean install

I moved sample.tei.xml from evaluation to corpus and run ./gradlew train_astro

then java -Xmx4G -jar build/libs/grobid-astro-0.5.1-SNAPSHOT-onejar.jar -gH ../grobid-home -dIn '/GrobidData/input' -dOut ~/test_data/out/ -exe createTraining

So the complete chain is working for the astro module (I have to try to rename everything again), but I don't have all the body text in the tei.xml file created. I already retrained the segmentation model with corrected TEI from my pdfs so I know the pdf i'm working on are 100% correctly segmented.

Also I really don't know what is the file crfpp-template and how to modify it according to my need.

LeelaMani commented 5 years ago

Hi My task is to retrain header part alone I have started training. Have a doubt if i am going in the right path. By retraining only header part will my overall score be reduced Please reply