srvarey / gbif-occurrencestore

Automatically exported from code.google.com/p/gbif-occurrencestore
0 stars 0 forks source link

Missing records? #9

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
1000000 record in raw_occurrence_record_small results in 999995 records after 
processing.

This might be valid, but inspect missing records to check.

Original issue reported on code.google.com by timrobertson100 on 18 Apr 2011 at 6:38

GoogleCodeExporter commented 9 years ago

Original comment by timrobertson100 on 4 May 2011 at 1:34

GoogleCodeExporter commented 9 years ago
After all the changes, this issue now stands as follows:

In Hive:
ROR_small: 1,000,000
OR: 996,908

This may still be valid, but needs inspection of the dropped records

Original comment by timrobertson100 on 5 May 2011 at 7:50

GoogleCodeExporter commented 9 years ago
These are all hybrids.

E.g. scientific name 
Bothus rhombus x maximus

Hybrids have not yet been handled in name parsing.  

However, these records *should* not be dropped by identified to a higher taxon.

E.g. Occurrence record 53801 has:
Animalia
  Chordata
    Osteichthyes
      Pleuronectiformes
        Bothidae
          Bothus
            Bothus rhombus x maximus

It should be identified to Bothus even if the final name is not found.  

The records are not getting an identification, and since the 
occurrence_record.q has:

 "JOIN ${occurrence_nub} nub ON r.id = nub.occurrence_id"

anything that is not identified somehow to the NUB, will be dropped.

Propose before handling Hybrids (which requires thought and advice from Andrea) 
we implement the higher name matching which is necessary anyway for names not 
in the nub.  Suspect this change needs done in the nub lookup udf.

Original comment by timrobertson100 on 5 May 2011 at 8:02