gigascience / gigadb-website

Source code for running GigaDB
http://gigadb.org
GNU General Public License v3.0
9 stars 15 forks source link

The Geolocation (altitude and longitude) of dataset in the sample attribute table is not standardized #2022

Closed kencho51 closed 2 months ago

kencho51 commented 2 months ago

Relate to #1944

When I executed the query in the actionMapbrowse in my staging database,

SELECT d.identifier,  d.title, satt.value, sp.scientific_name as sciname, s.id as sampleid FROM dataset as d
INNER JOIN dataset_sample as dsam on dsam.dataset_id = d.id
INNER JOIN sample as s on s.id = dsam.sample_id
INNER JOIN sample_attribute as satt on satt.sample_id=s.id
INNER JOIN species as sp on sp.id = s.species_id
where satt.attribute_id = 269 and d.upload_status='Published' order by sampleid;

the returned value are something like, not recorded, 00°15' N 98°08' E, off Heligoland, North Sea or 31.48.

I then further looked into the sample_attribute table using:

SELECT attribute_id, value, count(*) FROM sample_attribute WHERE attribute_id = 269 GROUP BY attribute_id, value order by count desc;

And then got this returns:

attribute_id value count
269 not provided    3473
269 restricted access   1096
269 not recorded    257
269 not collected   131
269 32°35'N 103°37'E  74
269 30°0'N 103°02'E   38
269 31°04′37″N, 121°10′35″E   29
269 Multiple    27
269 Not Recorded    22
269 29°02'N 100°17'E  19
269 30°34'N 104°03'E  19
269 35°24'N 116°34'E  19
269 29°24'N 105°36'E  18
269 19°07′16″N, 109°05′00″E   11
269 42°20' N, 126°50' E   11
269 not applicable  10
269 31.48   9
269 N31°, E117°   9
269 35.9940° N, 78.8986° W    8
269 Aldabra, Indian Ocean   5
269 Sri Lanka, Indian Ocean 4
269 off Heligoland, North Sea   4
269 Morro Bay, San Luis Obispo, California, USA, Pacific Ocean  3
269 02°50'56'' N 95°56'24'' E 3
269 19°00′07″N, 109°32′33″E   3
269 Costa Brava, Spain, Mediterranean Sea   3
269 Kobe, Japan, Pacific Ocean  3
269 62°14'S, 58°47'W  3
269 29°24′00″N, 110°09′56″E   3
269 nor recorded    3
269 24°04′34″N, 110°08′17″E   3
269 Loblolly Bay, Anegada, British Virgin Islands, Caribbean Sea    2
269 00°15' N 98°07' E 2
269 00°02' S 73°24' E 2
269 23°20′22″N, 108°22′49″E   2
269 Weddell Sea, Southern Ocean 2
269 Caribbean Sea   2
269 Iceland Basin, Atlantic Ocean   2
269 37°26'N, 126°22'E 2
269 Atapupu, Timor, Indonesia, Indian Ocean 2
269 Singapore, Pacific Ocean    2
269 30.50′–30.180′N, 114.210′–114.390′E 2
269 Arendal, Norway, Skagerrak, North Sea   2
269 33°35'20'' N 135°10'50'' E    2
269 East Asia, Pacific Ocean    2
269 21°54′3″N, 111°23′58″E    2
269 13894; 19°33’3�N, 110°47’25� E    2
269 S off New Scotia, Atlantic Ocean    2
269 New Caledonia, Pacific Ocean    2
269 Bassin de la Gazelle, Kerguelen Islands, Southern Ocean 2
269 00°15' N 98°08' E 2
269 Gauss Base, Antarctica, Southern Ocean  2
269 18°59′58″N, 109°05′03″E   2
269 North Sea   2
269 off Florida, USA, Gulf of Mexico    2
269 not available   2
269 Penguin Bank, Oahu Island, Hawaii, USA, Pacific Ocean   2
269 35.202773   1
269 off Elephant Island, Southern Ocean 1
269 Monrovia, Liberia   1
269 24°58′12″N, 112°53′17″E   1
269 35.114876   1
269 23°36′21″N, 113°57′18″E   1
269 Abeloya, Kong Karls Land    1
269 Not recorded    1
269 30°36′28″N, 119°52′22″E   1
269 Romblon, Philippines, Pacific Ocean 1
269 Hastings, Barbados, Caribbean Sea   1
269 Dry Tortugas, Florida, USA, Gulf of Mexico  1
269 29°37′51″N, 109°50′17″E   1
269 Rovinj, Croatia, Mediterranean Sea  1
269 26°47′29″N, 117°02′21″E   1
269 Polana Beach, Mozambique, Indian Ocean  1
269 21°50′13″N, 108°03′27″E   1
269 Madeira, Portugal, Atlantic Ocean   1
269 Misaki, Japan, Pacific Ocean    1
269 Parry Island, Eniwetok Atoll, Marshall Islands, Pacific Ocean   1
269 18°54′40″N, 109°41′12″E   1
269 25°40'00'' S 86°46'05'' W 1
269 23°10′47″N, 113°21′33″E   1
269 18°39′43″N, 110°16′19″E   1
269 Pemba Channel, Tanzania, Indian Ocean   1
269 33°56' N 26°43' E 1
269 Port Jackson, Sydney, Australia, Pacific Ocean  1
269 Mombasa, Kenya, Indian Ocean    1
269 30°11′36″N, 118°09′20″E   1
269 35°10' S 23°02' E 1
269 Low Island, Low Isles, Australia, Pacific Ocean 1
269 01°49' N 45°29' E 1
269 Ecuador, Pacific Ocean  1
269 Ralum, East New Britain, Papua New Guinea, Pacific Ocean,   1
269 60° 9′ 0″ N, 24° 39′ 30″ E    1
269 Maroon Point, Mauritius, Indian Ocean   1
269 40°0´38.02´'N3°52´41.4´´E  1
269 Paita, Peru, Pacific Ocean  1
269 St. Thomas, Caribbean Sea   1
269 22°26′46″N, 106°53′17″E   1
269 61°42' N 09°36' W 1
269 35.161870   1
269 29°46′34″N, 110°05′50″E   1
269 Golf of Suez, Red Sea, Indian Ocean 1
269 31°58'82'' S 174°15'87'' E    1
269 29°33′33″N, 103°18′45″E   1
269 Azores, Portugal, Atlantic Ocean    1
269 43°56' S 60°52' W 1
269 Papua New Guinea, Pacific Ocean 1
269 45.5° N, 122.7° W 1
269     1
269 06°19' S 12°02' E 1
269 18°45′15″N, 110°12′43″E   1
269 35.192253   1
269 Angra Pequena, Lüderitz, Namibia, Atlantic Ocean 1
269 34.0689° N, 118.4452° W   1
269 Jakarta, Indonesia, Pacific Ocean   1
269 30°02'09'' N 28°24'02'' W 1
269 S Chile, Pacific Ocean  1

So clearly, besides the CSP problem, the value itself in the sample_attribute table is not standardized and not formatted well, it causes the geojson_features array is empty, which leads to no dots can be seen in the map in the /site/mapbrowse page.

The sanity check in the mapbrowse.php is not enough to check all the value cases in the sample_attribute table in the current production database.

To do:

rija commented 2 months ago

To ensure the relevant information stay together, here are the gitter chat converstion adding precision to this ticket:

The ticket of concern is #2022, not #1944. The #1944's purpose is now only about fixing the CSP issue which I believed Ken is finalising a PR for.

When you deploy GigaDB to AWS, the RDS PostgreSQL there get loaded with a backup of live data You can make the following SQL query (from #2022 which also show the result) using pgAdmin/DBeaver/PHPStorm to get data that feeds into mapbrowse.php:

SELECT attribute_id, value, count(*) FROM sample_attribute WHERE attribute_id = 269 GROUP BY attribute_id, value order by count desc;

269 is the ID for geographic location (latitude and longitude) in the attribute table. Clearly most, if not all of the results are invalid coordinates.

There is another query you can make:

SELECT attribute_id, value, count(*) FROM sample_attribute WHERE attribute_id = 270 GROUP BY attribute_id, value order by count desc;

Where 270 is the ID for geographic location (country and/or sea,region)

So there are cases in the results where the author seemed to have used the wrong attribute and put info in 269 instead 270. There are also cases of incorrect annotation to describe geolocation (that includes with weird characters too) Finally there are the cases where the author cannot input the geolocation data and instead fills in the reason.

The curation guidelines in https://gigadb.org/site/guide (look for the aforementioned attributes labels in the "Required metadata" section) Is very clear about what the annotation should be

Resolving #2022 seems to be about:

  1. data correction
  1. Code change
rija commented 2 months ago

Hi @alli83,

After talking to @only1chunts, I can confirm the button on the page that calls the sample map should be seen by everyone, not just logged-in user.

only1chunts commented 2 months ago

Hi, Not been following this as closely as I should have been, but please do NOT use attribute 269, that is an old obsolete attributes that are no longer being actively used and eventually the curation team will convert those to the individual attributes for latitude (391) and longitude (392). The individual lat and lon terms are the ones that should be mapped as per the request in ticket #831.

rija commented 2 months ago

Hi @alli83,

It seems that we were looking at the wrong attributes IDs. Making the work described in this ticket redundant.

The work to be done to get mapbrowse working again is to update the script to read latitude and longitude from different attribute values. That work is already described in #831 which is the one to work on, so I'm closing this one