kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.3k stars 439 forks source link

Affiliation Processing: Training required? #794

Open dominic01 opened 3 years ago

dominic01 commented 3 years ago

Was trying to process Graduate University of Sciences and Technology, Vietnam Academy of Science and Technology, Hanoi, Vietnam

Using ProcessAffiliation in https://cloud.science-miner.com/grobid/

<affiliation>
    <orgName type="institution">Graduate University of Sciences and Technology</orgName>
    <address>
        <country key="VN">Vietnam</country>
    </address>
</affiliation>
<affiliation>
    <orgName type="institution">Academy of Science and Technology</orgName>
    <address>
        <settlement>Hanoi</settlement>
        <country key="VN">Vietnam</country>
    </address>
</affiliation>

Do I need to train the affiliation model? 90% of the affiliation would have Country or City name in it.

kermitt2 commented 3 years ago

Hello @dominic01

Yes, If a particular example fails, the idea is to add it to the training data and retrain a model.

About the second example, it's not frequent at all that an affiliation address block comes without end of line and commas. The model is trained with actual affliliation-address blocks as extracted from a PDF.

dominic01 commented 3 years ago

Got It, The tags gave a wrong view. Corrected in my post.

Some more examples: Biotechnology Research and Development Institute, Can Tho University, Can Tho City, Vietnam. Institute of Tropical Biology, Vietnam Academy of Science and Technology, Hanoi, Vietnam.