humlab-sead / sead_bugs_import

SEAD bugs import
2 stars 0 forks source link

Bugs ecocode GeoJSON export returns wrong values (counts and sums). #19

Closed roger-mahler closed 4 years ago

roger-mahler commented 5 years ago

This issue is related to #17 and #18. These two queries should return the same data:

SELECT ps.physical_sample_id,
            sum(ab.abundance) AS abundance_sum,
            count(ab.abundance) AS abundance_count
FROM tbl_analysis_entities ae
JOIN tbl_datasets ds
  ON ds.dataset_id = ae.dataset_id
JOIN tbl_dataset_masters ms
  ON ms.master_set_id = ds.master_set_id     
JOIN tbl_physical_samples ps ON ps.physical_sample_id = ae.physical_sample_id
JOIN tbl_abundances ab ON ab.analysis_entity_id = ae.analysis_entity_id
JOIN tbl_taxa_tree_master tm ON tm.taxon_id = ab.taxon_id
WHERE ms.master_name = 'Bugs database'
GROUP BY ps.physical_sample_id
ORDER BY 1;

SELECT  ps.physical_sample_id, 
        sum(ab.abundance) AS abundance_sum,
        count(ab.abundance) AS abundance_count
FROM tbl_analysis_entities ae
JOIN tbl_datasets ds
  ON ds.dataset_id = ae.dataset_id
JOIN tbl_dataset_masters ms
  ON ms.master_set_id = ds.master_set_id
JOIN tbl_physical_samples ps
  ON ps.physical_sample_id = ae.physical_sample_id
JOIN tbl_abundances ab
  ON ab.analysis_entity_id = ae.analysis_entity_id
--JOIN tbl_taxa_tree_master tm
--  ON tm.taxon_id = ab.taxon_id
JOIN tbl_ecocodes e
  ON ab.taxon_id = e.taxon_id
JOIN tbl_ecocode_definitions ed
  ON e.ecocode_definition_id = ed.ecocode_definition_id
JOIN tbl_ecocode_groups eg
  ON eg.ecocode_group_id = ed.ecocode_group_id
WHERE TRUE
  AND eg.ecocode_system_id = 2
  AND ms.master_name = 'Bugs database'
GROUP BY ps.physical_sample_id
ORDER BY 1;
roger-mahler commented 5 years ago

Also note that GeoJSON export in production (materialized view physical_sample_ecocode_abundance lacks filter on master data set and ecocode system id. This will not work in databases that contain other non-Bugs data.

Loer0022 commented 5 years ago

I believe this is a case of misunderstanding the Bugs data, because the two queries are requesting different things and cannot provide the same output because of that.

What the first query is returning is taxa in relation to physical samples, meaning that only a value of 1 can be returned because there can only be one taxa (species) and that means for e.g. sample 2275 that there will be 125 different species. However, if we look at the second query it is returning taxa as well as the habitats they are representing meaning that for each taxa there can be a value of 1+X in abundance, where each species can represent one or more habitats. The abundance is therefore often greater than the total number of species present in the sample.

The two shouldn’t return the same data and they are exporting correct values.

A last note is the satement below, using that will not constrain the output and the ecocodes might combine Koch and Bugs ecology codes meaning that the database will provide erroneous outputs. You can see the error in the first set of images where an extra taxon_ID is added, 28984. Why this happens I am not sure.

WHERE ms.master_name = 'Bugs database'


The first of the images below is the origional output from the first query showing that there are 125 species in sample 2275, the second image is only showing the individual taxa instead of the total count/sum output.

Querry 1

Query 1 - true export


If we look at the second query output we can see that the Count and Sum are taking onto consideraiton the species habitat, and these values are correct when comparing with previously exported data. We can further see that each taxa has a abundance in general above 1 because of the habitats they are representing, and further dividing it to see each taxa in relation to ecocode_definitions they are representing (which the first query does not have) we can further control that the output is correct.

Query 2 - true export2

Query 2 - true export

Query 2

In the last image we can see that ecocodes_definitions_ID 10 is shown twice, because the row is not unique this will not cause any issue since it isn't counted twice.

visead commented 4 years ago

Bump @Loer0022 Then run GeoJSON export and send new file to DataARC

Loer0022 commented 4 years ago

These are the corrections and modifions made for the new Bugs GeoJSON.

...deprecated code...
Loer0022 commented 4 years ago

NOTE: in the querry code the keys should be,

fn_ecocode_crosstab _2_2 --- the code used here accidentally applied old keys which would simply have given an error output. The 2_2(1, 'count/sum'::text) code will export correct BugsCEP data.

Ecocode_system = 2 (Bugs ecocodes) Ecocode_group_id = 2 (Bugs) Master_set_id = 1 (Bugs database)

roger-mahler commented 4 years ago

Unfortunately the GeoJSON export turned out to be only half-way implemented.

Closing this issue in favor of https://github.com/humlab-sead/sead_change_control/issues/64.