inukshuk / anystyle

Fast citation reference parsing
https://anystyle.io
Other
1.04k stars 89 forks source link

Add more training data #109

Closed sarankup closed 6 years ago

sarankup commented 6 years ago

Hi,

I am creating a huge set of entities in our dataset. For example, more publisher name, journal name etc., I have a couple of question. Please advise how to achieve this.

[1] How to add more records to the existing entities? [2] How to add an additional category, for example, legislation?

Thanks, in advance.

inukshuk commented 6 years ago

Representative training data is always a good way to improve results; in the case of publisher and journal names you could also add them to the dictionary (to add them to the default dictionary you might consider adding them to anystyle-data here).

And to answer your questions:

  1. You mean how to train a model using your own data set? First, create an XML data set like those you can find here. You can use the Wapiti::Dataset class to load and work with such training sets using Ruby's default comparison and set operator methods. Given such a training set, you can train a parser model using the parser's #train method (you can pass your training set as a parameter). After training is complete, you have to save the model using #model.save. You can see an example using the default parser instance in the rake train task here

  2. If you add new labels (like 'legislation') to your training set and train a model with it, the parser will automatically start using the label. Note however, that default normalizers will not do anything with such segments (because they are not aware of them). You can add your own normalizers or just post-process the data after parsing.

a-fent commented 6 years ago

Just to expand on (2) - to add a new type of item, for example "legislation", you would need to alter the type normaliser in lib/normalizers/type.rb. This classifies each record according to the presence or absence of various fields (e.g. a "journal article" must have a "journal"), and according to the content of those fields.

But what you want to do is very much possible - I have had good results identifying and classifying citations of things like government regulations and standards by adding records to the training set, as @inukshuk suggests, and then tweaking the type classifier. See this question: https://github.com/inukshuk/anystyle/issues/106, also I will try and update my fork on github to my current code which has this stuff included.

silviaegt commented 5 years ago

Hi @inukshuk! I love anystyle and would love to know how to contribute (also I plan to use your development for my PhD thesis!). I am not a coder but I more or less understand your code and have some ideas that might be useful:

  1. Amanda Whitmire has been using CrossRef Simple Text Query to retrieve an article's DOI which she later parses with https://doi2bib.org/ -- and I feel like it might be possible to create a win-win collaboration with CrossRef
  2. Have you tried adding NER to your process? I can see you have a great list of English names but NER might be more robust for multilingual references?
  3. Also, was wondering if Wikidata might be useful to enrich your training data? This query, for instance, retrieves 45,654 journal titles with their description: https://w.wiki/5$J
inukshuk commented 5 years ago

Absolutely, if you know that your data is available in a database such as CrossRef it is a very good approach to try to find the references there based on your input data. Using this approach you can either forgo parsing completely (if I'm not mistaken that's the approach described by the tweet) or you could combine the two: parse the input with something like AnyStyle and then use the parse result to query databases for an authoritative or canonical version of the reference.

Adding NER to the process might be interesting. Currently AnyStyle's 'dictionary' features are based on data in a separate gem and based on single word look-ups. I have no data to back this up, but I doubt that you can improve the results dramatically by having better NER in the feature extraction phase: after all you still have the fundamental issue that named entities can occur all over the place, just recognizing the name does not help identify the right segment (is it part of the author, editor, title, publisher, etc.?).

However, NER might also be useful in the normalizer phase (which happens after parsing), to help get canonical names for things like journals.

inukshuk commented 5 years ago

I should add that I would recommend using the CLI version of anystyle: it is much improved over the version hosted on anystyle.io. What's more, the server currently hosting anystyle.io is going to be shut down in about a month and the web service will probably stay offline until I, or someone else, finds the time to update it to anystyle 1.0.

silviaegt commented 5 years ago

Hey @inukshuk, thanks for the quick reply, not sure why I didn't get any notifications! Sorry 'bout that. I did see your gem's data, and that is part of the reasons why I suggested Wikidata in order to enrich your "journals_XX.txt" lists. Did you take a look at the 45,654 journal titles query: https://w.wiki/5$J? It might also be useful to improve the "surnames_XXX.txt" lists. For instance, this query (which can be download as a CSV as well) shows the 20,000 most common surnames in Wikidata: https://w.wiki/68c

inukshuk commented 5 years ago

I'm happy to add any useful word list to the data gem. The main issue with journal titles, in my experience, are the oftentimes cryptic abbreviations -- so what would be more useful, I think, than more dictionary data would be mapping of all the variations of journal names to a canonical one (that would make for a great normalizer). That said, adding more journal names or surnames would be great: the format used is a text file with one word per line and an #! <tag> instruction at the top where #! name would add the words to the name dictionary and #! journal to the journals dictionary (the ingestion script converts everything to lower case, removes diacritics and normalizes to the unicode mode used internally by the parser so there's no need to pre-process the text input, except for the word-segmentation).

larawehbe commented 4 months ago

Hello How should i use the anystyle-data with anystyle parser ?

require 'anystyle'

# Set Anystyle to use an in-memory dictionary
AnyStyle.options[:dictionary_adapter] = :ruby

# Example usage of the parser
reference = "Smith, J. (2020). Title of the Study. Journal of Advanced Research, 34(2), 123-130."
result = AnyStyle.parse reference

puts result.inspect

say i have this code, i want to add my own dictionary list to it so as to catch a specific keyword, say pps to refer to pages for example how can i do this ? sorry about that, i think the documentation for dictionaries requires a bit more explanation thank you in advance!

inukshuk commented 4 months ago

@larawehbe the dictionary contains words for the tags name, publisher, journal and place.

The various names for 'pages' are part of the keyword feature. Looking at the pages keyword pattern the word 'pps' isn't currently matched -- is this commonly used? Then we should probably just add it there.

Regarding anystyle-data -- that's the default data used for the dictionary feature. It's a normal Ruby hash that's saved to a file on disk. You can just added additional words to it and overwrite the file if you. However, the dictionary is only used to check for name, place, publisher and journal words.

larawehbe commented 4 months ago

I see now so if i want to add special files to capture specific patterns in arabic language, can i do this using dictionaries? For example, Page in arabic has the following pattern: ص 201 ص refers to "page" now if i want to create a txt file called pages.txt and add the patterns i want, does it work? same goes for editor for example, each editor is preceeded by المحرر (which means editor) and some other words Could i use dictionaries in this case?

larawehbe commented 4 months ago
Screenshot 2024-06-11 at 9 12 19 PM

Also, I installed anystyle-data repository in order to append my own dictionary, but not sure how should i use it inside the parser ? and if i updated the dict.txt and then added it as dict.txt.gz what should i do so AnyStyle.parse would take it into consideration ?

Your support is highly apprecaited thank you for the quick reply!

larawehbe commented 4 months ago

@inukshuk For features, when /^(pp?|pages?|S(eiten?)?|ff?)$/ :page if i want to add the arabic pattern, i could just make it like this :

     when /^(pp?|pages?|S(eiten?)?|ff?|ص?|صفحة?|)$/
            :page

Correct ?

inukshuk commented 4 months ago

The dictionary feature is only used for the four tags I mentioned earlier. To add default dictionary data, you can add them to the lists in the data repository. Those lists are all compiled in the zip file which is later used to initialize new dictionaries. The lists are just text files with a special comment line #! <tag> indicating which tag to use for the words in the list.

Both editor and pages are covered by the keywords feature instead. And yes, if you modify the pattern as above then the matching words would be tagged as :page by the keyword feature. This might improve parse results. However, it should also be OK to just add some references containing the word in your training data. As long as the word for pages is not commonly used elsewhere in the reference you should be able to get good results even without modifying the feature.

larawehbe commented 4 months ago

However, it should also be OK to just add some references containing the word in your training data

Thank you so much for this detailed explanation. Could you please elaborate here? you mean adding more samples in the training xml file? or referring to something else?

Also, for dictionary:

  1. It means i cant a new tag that's not covered in these 4 mentioned, right?
  2. if i edited the dictionary file in anystyle-data.(For instance i have it /Library/Ruby/Gems/gems/...../anystyle-data) , can you please lead me to the place where i can instantiate the dictionary in my ruby code and add it to the parser? the documentation seems a bit high level without covering these details

Thanks in advance!

inukshuk commented 4 months ago

The first step to get better results should definitely be to train your own model. It's definitely a good idea to use the our default set as the basis, but then add your own references to it. The default model is trained using the core.xml. So to modify the model, I would use that file and add 20-30 of your own references, consistently tagged, to it and use it to train a new model.

Normally, this should already give you good results for your data set. However, since Arabic is not well supported by the feature extraction used by the parser, I'm pretty sure that there are a lot of ways to further improve the results by making some of the features aware of details specific to Arabic. In particular adding to the keyword feature sounds like a good idea.

Adding Arabic names, journal titles, places or publishers to the dictionary feature could also help, but I doubt that this will have a big impact. I'd consider doing this only if you notice that authors, place, publisher, or journal fields are tagged incorrectly even after training your model on Arabic references.

To train a new model you can use the CLI tool or the Rake scripts in this repository (or you can write your own script of course, but the scripts in the Rakefile are good examples).

larawehbe commented 4 months ago

@inukshuk Noted thanks I trained the model and indeed it gave better results but failed sometimes with fields that have specific keywords as editor and page as i mentioned.

So if i want to apply adding the keyword feature and the dictionary, i edit the source code inside /Library/Ruby/Gems/..../anystyle/lib/anystyle/... ? also for dictionary, i edit the one inside anystyle-data and it will work. ? or should i add anything in the code that imports the anystyle library and uses it ?

inukshuk commented 4 months ago

Yes you can edit the locally installed keyword feature file and add Arabic terms to the patterns for pages and editors for example. If these terms are commonly used in Arabic references, I'd also be happy to merge the changes here if you open a PR.

Similarly for the dictionary. If you have curated lists of Arabic names, publishers, journals, etc. you can add them to your local dictionary or you can also add them to the source files in anystyle-data and use the rake compile task there to compile a new default list for the dictionary. If you want to work with the Dictionary, I'd consult the source code of the class: it is only a thin wrapper around a Hash containing all the words. When using the Ruby adapter, this is the class that handles loading and saving of the Hash on disk.