grambank / pygrambank

Apache License 2.0
4 stars 1 forks source link

sourcelookup not evaluating a source #51

Closed HedvigS closed 2 years ago

HedvigS commented 2 years ago

For feature GB126 in sheet MM_kore1280 in grambank original sheet, there is a source listed as "Sohn (2015:324) Grammaticalization". I'm guessing this is because there is more than one work by Sohn from this year, and that it should be "Sohn_Grammaticalization (2015:324)". However, pygrambank sourcelookup doesn't seem to evaluate this source. When I run it on this sheet, I get:

Resolved sources:
155 g_Sohn_Korean   Sohn, Ho-min. 1994. Korean. (Descriptive Grammars Series.) London: Routledge. xvii+584pp.
120 g_Sohn_Korean_1999  Sohn, Ho-Min. 1999. The Korean language. Cambrige: Cambridge University Press. 462pp.
20  g_LeeRamsey_Korean  Iksop Lee and Ramsey, Robert S. 2000. The Korean Language. (SUNY Series in Korean Studies.) New York: State University of New York Press. xiii+374pp.

OK

Shouldn't it also check "Sohn (2015:324) Grammaticalization" and throw some kind of warning?

xrotwang commented 2 years ago

AFAICT Glottolog doesn't have anything by Sohn from 2015 in hh.bib. The problem here is that sourcelookup is happy as long as at least one reference can be matched per datapoint. So the unmatched Sohn (2015:324) Grammaticalization is simply read as some sort of comment.

HedvigS commented 2 years ago

Yes, I would want it to say that it tried matching Sohn 2015 to hh.bib or gb.bib and that it couldn't.

Right, okay. What can we do to change that? I know that sourcelookup already ignores things like "personal correspondence" etc, but in thise case we'd like it to try and resolve it (and fail).

xrotwang commented 2 years ago

So, something like this seems fairly easy to implement:

$ grambank sourcelookup original_sheets/MM_kore1280.tsv ~/projects/glottolog/glottolog
WARNING:pygrambank.srctok:unmatched ref: ('Robbeets', '2017', '611', None)
WARNING:pygrambank.srctok:unmatched ref: ('Sohn', '2015', '324', None)
WARNING:pygrambank.srctok:unmatched ref: ('Sohn', '2015', '324', None)
WARNING:pygrambank.srctok:unmatched ref: ('Sohn', '2015', '324', None)
WARNING:pygrambank.srctok:unmatched ref: ('Sohn', '2015', '324', None)
WARNING:pygrambank.srctok:unmatched ref: ('Sohn', '2015', '325', None)
Resolved sources:
155 g_Sohn_Korean   Sohn, Ho-min. 1994. Korean. (Descriptive Grammars Series.) London: Routledge. xvii+584pp.
120 g_Sohn_Korean_1999  Sohn, Ho-Min. 1999. The Korean language. Cambrige: Cambridge University Press. 462pp.
20  g_LeeRamsey_Korean  Iksop Lee and Ramsey, Robert S. 2000. The Korean Language. (SUNY Series in Korean Studies.) New York: State University of New York Press. xiii+374pp.

OK
HedvigS commented 2 years ago

Tha'd be great!

And maybe also some kind of warning for strings that include a space and something after the YEAR:PAGES. Like in this case the " grammaticalization".

xrotwang commented 2 years ago

fix pushed to master

HedvigS commented 2 years ago

Thank you

HedvigS commented 2 years ago

it used to be that the pygrambank cldf command ran sourcelookup on every sheet right, so that the output from there could be used for our "check warnings"-todo list. This is still the same with cldfbench right?

xrotwang commented 2 years ago

yes

HedvigS commented 2 years ago

Okay, thanks. I'll try and re-install and run it so that I can update the to do list for automatic warnings.

HedvigS commented 2 years ago

cldfbench is complaining that the proto-languages, like ocea1241, doesn't have a macroaea. Is this something that mucks something up? Would you like me to submit a PR to glottolog/glottolog adding macroareas for family-level languoids?

xrotwang commented 2 years ago

Just ignore this warning.

Hedvig Skirgård @.***> schrieb am Mi., 12. Jan. 2022, 17:53:

cldfbench is complaining that the proto-languages, like ocea1241, doesn't have a macroaea. Is this something that mucks something up? Would you like me to submit a PR to glottolog/glottolog adding macroareas for family-level languoids?

— Reply to this email directly, view it on GitHub https://github.com/grambank/pygrambank/issues/51#issuecomment-1011251552, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKETGZOQVCI2UDFGEQTUVWWW3ANCNFSM5LTMSQRQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you modified the open/close state.Message ID: @.***>

HedvigS commented 2 years ago

Ok.

HedvigS commented 2 years ago

I just noticed a shortcoming with the new implementation of the sourcelookup. When I run clfbench I get warnings like above, but unlike the other feedback it doesn't tell me in what sheets the warning is occurring. That makes it hard to do a whole evaluation of all sheets to update the to do-lists.

WARNING:pygrambank.srctok:unmatched ref: ('Mangulu', '2002', None, None)
WARNING:pygrambank.srctok:unmatched ref: ('Mangulu', '2002', None, None)
WARNING:pygrambank.srctok:unmatched ref: ('Mangulu', '2002', None, None)
WARNING:pygrambank.srctok:unmatched ref: ('Mangulu', '2002', None, None)
WARNING:pygrambank.srctok:unmatched ref: ('Mangulu', '2002', None, None)
HedvigS commented 2 years ago

I can read in all the sheets line by line and then do a match to this output, but it'd be easier if it could report the sheet right away please.