srvarey / gbif-occurrencestore

Automatically exported from code.google.com/p/gbif-occurrencestore
0 stars 0 forks source link

Occurrence join to taxon concept not working correctly #27

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Occurrence ID: 311632412

select 
data_resource_id,kingdom,phylum,class,order_rank,family,genus,scientific_name,au
thor from tim_rollover2_portal_raw_occurrence_record where id= 311632412
=> 
1440    NULL    NULL    NULL    NULL    NULL    NULL    Valeriana edulis    NULL

However, this ID was retrieved using:
select occurrence_id from tim_rollover_temp_taxon_occurrence t1 join 
tim_rollover2_portal_taxon_concept t2 on t1.taxon_concept_id=t2.id where 
t2.data_resource_id=1472 and t2.taxon_name_id is null

Note that the TC is for data_resource_id 1472 but the occurrence record is 
actually for a record from data_resource_id 1440

Investigate where this is going wrong

Original issue reported on code.google.com by timrobertson100 on 27 Apr 2011 at 6:08

GoogleCodeExporter commented 9 years ago
Taxon_concept_id in question: 53354205

Original comment by timrobertson100 on 27 Apr 2011 at 6:11

GoogleCodeExporter commented 9 years ago
select * from tim_rollover2_temp_taxon_distinct_taxonomy where 
taxon_concept_id= 53354205
=> 53354205 S

How can this be S when it is an INT ID?

Original comment by timrobertson100 on 27 Apr 2011 at 6:16

GoogleCodeExporter commented 9 years ago
select * from tim_rollover2_temp_normalized3 where id= 53354205
53354205    53354202    1472    NULL    0   ["S"]

Indeed it is "S"

Original comment by timrobertson100 on 27 Apr 2011 at 6:21

GoogleCodeExporter commented 9 years ago
select * from tim_rollover2_temp_normalized2 where id= 53354205
0   53354205    1472    123 120 Spatoglossum asperum    J. Agardh   0   S

Still "S" and still 1472

Original comment by timrobertson100 on 27 Apr 2011 at 6:26

GoogleCodeExporter commented 9 years ago
Strange result in the temp_distinct_taxonomy itself:

select * from tim_rollover2_temp_distinct_taxonomy where data_resource_id=1472 
and scientific_name='Spatoglossum asperum'

2706679 1472    NULL    NULL    NULL    NULL    Dictyotaceae    Spatoglossum    Spatoglossum 
asperum J. 
Agardh  ["57325397","57325399","57325422","57325425","57325439","57325447","57325
465","57325471","57325497","57325499","57325501","57325503"]

2706680 1472    NULL    NULL    NULL    NULL    Dictyotaceae    Spatoglossum    Spatoglossum 
asperum J. Agardh   `   ["57325451"]

The columns don't align and there is a spurious ` either in the author or the 
occurrence field.

Original comment by timrobertson100 on 27 Apr 2011 at 6:31

GoogleCodeExporter commented 9 years ago
Hive is producing strange results.

Getting the CSV file of the following query:
select author from tim_rollover2_temp_distinct_taxonomy where 
data_resource_id=1472 and scientific_name='Spatoglossum asperum'

provides:

"_col0"
"J. Agardh"
"J. Agardh","`"

This seems to be throwing everything off subsequently, and likely to be the 
cause of the real issue logged here.

Original comment by timrobertson100 on 27 Apr 2011 at 6:35

GoogleCodeExporter commented 9 years ago
Looking at the LIVE portal, we see:

mysql> select data_resource_id,author from raw_occurrence_record where 
id=57325451; 
+------------------+-------------+
| data_resource_id | author      |
+------------------+-------------+
|             1472 | J. Agardh  ` | 
+------------------+-------------+

The good news is that the DR id looks aligned, so it seems this field is 
throwing everything off, and misaligning joins.  

Need to investigate what the character is that is causing this

Original comment by timrobertson100 on 27 Apr 2011 at 6:41

GoogleCodeExporter commented 9 years ago
It is a \t

Original comment by timrobertson100 on 27 Apr 2011 at 6:46

GoogleCodeExporter commented 9 years ago

Original comment by timrobertson100 on 27 Apr 2011 at 6:48

GoogleCodeExporter commented 9 years ago

Original comment by oliver.m...@gmail.com on 28 Apr 2011 at 7:24

GoogleCodeExporter commented 9 years ago
This appears to be solved due to \t (and other bad char) cleanup in the source.

Original comment by oliver.m...@gmail.com on 25 May 2011 at 8:02