Open JenPreciado opened 7 years ago
Dear @JenPreciado, apologise for this very very late answer, it is indeed long time since you opened this issue.
In general, the amount of training data available for grobid-quantities, as of today, is still quite limited, and cover mostly papers from astronomy, health and physics. Said that we are quite happy for the current performances, however we are aware there is needs for more data.
We haven't really focused on recognising vague expressions because there was the problem, afterwards on how to normalise them (e.g. expressions like in recent time
) . We have decided to leave the vague
part of the expression out of the annotation e.g. later <date>July</date>
and focus on resolving the expression itelf.
BTW The guidelines annotations (https://grobid-quantities.readthedocs.io/en/latest/guidelines.html) have been improved in the last year with more examples and special cases. In brief we annotate time/date expressions as <date when="2001-08">2001 August</date>
(https://grobid-quantities.readthedocs.io/en/latest/guidelines.html#additional-items).
If you are still working on it and you want to share what you have done, we can see whether there are some complementary needs.
I recently annotated a cyology-related blog and noticed that grobid doesn't allow for vague or inexplicit units of time to be captured. Examples of these include: late July, early August, end of the month, this week, through April, recent decades etc.
I also noticed that it ignores mentions of seasons like spring, fall, summer, summertime, winter, wintertime. Cryology has it's own unique terms to denote seasons like melt season or ice growth period.
It would be very useful if grobid 1)could capture these vague time expressions, 2) if it could be linked to the document/blog/articles publishing date, and 3) if grobid allowed prototypical seasons (if not also those unique to cryology season terms) to be captured as a kind of time expression.