gnames / bhlindex

BHLindex is used by Biodiversity Heritage Library to create their scientific names index
MIT License
9 stars 1 forks source link

As a BHL developer I want a shorter version of a dump with filtered data #61

Closed dimus closed 1 year ago

dimus commented 1 year ago

According to @mlichtenberg the following filters are currently applied in the previous version of bhlindex:

If the output WILL be filtered, then the needed columns are

names.csv

NameID
DetectedName
MatchedCanonical
MatchedFullName
RecordID
DataSourceID

occurrences.csv

NameID
PageID

If the output will NOT be filtered, then the needed columns are:

names.csv

NameID
DetectedName
MatchedCanonical
MatchedFullName
RecordID
DataSourceID
MatchSortOrder
MatchType
OddsLog10
Curation
Error

occurrences.csv

NameID
PageID
dimus commented 1 year ago

Filter:

COPY (
SELECT [n.name](http://n.name/), n.matched_name, n.matched_canonical
FROM name_strings n INNER JOIN name_statuses st ON [n.name](http://n.name/) = [st.name](http://st.name/)
WHERE (n.match_type IN ('ExactMatch', 'ExactCanonicalMatch') AND n.curation <> 'Unknown')
OR (n.match_type IN ('FuzzyCanonical', 'FuzzyPartial') AND (st.odds > 1000000 OR n.edit_distance IN (0,1) OR n.stem_edit_distance IN (0,1)))
OR (n.match_type IN ('NoMatch', '') AND st.odds > 1000000)
OR (n.match_type = 'ExactPartialMatch')
) TO STDOUT DELIMITER '|'