readsoftware / read

Research Environment for Ancient Documents
GNU General Public License v3.0
38 stars 4 forks source link

#201 fix terms and handling of terms with commas #10

Closed xadxura closed 3 years ago

xadxura commented 3 years ago

Updated terms to standardize mediums with lower case and other minor changes. Added new terms for missing terms and updated for Mayan terms. Fixed bug in Entity to handle comma'd terms.

stevewh commented 3 years ago

The term.sql conflicts with trm_id values in use for Mayan. enclosed is a term.sql (marked as term.txt) that should be ok with the exception of the location property labels (country, district, area, site, etc.) and numbering style in any database that uses these trm_ids. This can be done through checking trm_id fields in the datamodel for the old id values and updating them to the new values. We should review this to identify which databases require change. term.txt

for the regular expression we should test a string of labels like 'en=>"schist, green", it=>"scisto, verde"'

xadxura commented 3 years ago

The term.sql conflicts with trm_id values in use for Mayan. enclosed is a term.sql (marked as term.txt) that should be ok with the exception of the location property labels (country, district, area, site, etc.) and numbering style in any database that uses these trm_ids. This can be done through checking trm_id fields in the datamodel for the old id values and updating them to the new values. We should review this to identify which databases require change. term.txt

for the regular expression we should test a string of labels like 'en=>"schist, green", it=>"scisto, verde"'

Pushed and updated term list which retains the IDs for Mayan. The IDs are now off for Gandhari, but we have no expectation or need to take an update to Gandhari.orgs db from terms.sql, so the differences can be ignored. I tested 'en=>"schist, green", it=>"scisto, verde"' and found that the updated code works as intended for en - value returned is schist, green, not scisto, verde. Result is better than previous form for this string. I didn't try changing the language code to try and extract the Italian string.

xadxura commented 3 years ago

@stevewh, this should be ready to merge.

stevewh commented 3 years ago

Great work! We should create an issue for the testing of multiple language term extraction with low priority.

On Fri, Sep 10, 2021 at 10:45 AM Andrew Glass @.***> wrote:

Merged #10 https://github.com/readsoftware/read/pull/10 into maindev2.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/readsoftware/read/pull/10#event-5285766380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARYOIPYDOTVAFJ7TZILOCTUBI73VANCNFSM5DLAY4IQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.