humlab-sead / sead_query_api

.NET Core port of Query SEAD API
Other
2 stars 0 forks source link

Bibliography filters display empty rows #104

Open Maanin00 opened 1 year ago

Maanin00 commented 1 year ago

Bibligraphy modern, Bibligraphy sites and Bibligraphy sites/samplegroups all contains rows with no text, nothing in the text column just values in the number column. If selected, the filter output shows a error message or displays the spinning search indicator indefinately. Should the rows contain text or are there faulty rows in db?

bild

roger-mahler commented 5 months ago

Problem is caused by authors and/or title being null. Most of the references are MAL reports.

There are in total 389 references in SEAD associated to MAL data. Only sample group or site references are considered, as given by this query:

with sample_groups as (
    select distinct sg.sample_group_id, sg.site_id
    from tbl_datasets d
    join tbl_dataset_masters dm using (master_set_id)
    join tbl_analysis_entities ae using (dataset_id)
    join tbl_physical_samples ps using (physical_sample_id)
    join tbl_sample_groups sg using (sample_group_id)
    where master_set_id = 2
),
mal_references as (
    select distinct biblio_id
    from sample_groups
    join tbl_sites s using (site_id)
    join tbl_site_references sr using (site_id)
    union
    select distinct biblio_id
    from sample_groups
    join tbl_sample_group_references sgr using (sample_group_id)    
)
    select *
    from tbl_biblio
    join mal_references using (biblio_id)

References in the following tables are ignored:

Ignored:

roger-mahler commented 5 months ago

The MAL report register (dated 20240116) contains 1096 entries. Only 28 of these reports exist in SEAD when matching match on full title:

select *
from tbl_biblio
join temp_mal_report_register using (title)

The MAL report register has been imported to sead_staging@staging_cluster.

roger-mahler commented 5 months ago

With some cleaning we get 126 matches:

with sample_groups as (
    select distinct sg.sample_group_id, sg.site_id
    from tbl_datasets d
    join tbl_dataset_masters dm using (master_set_id)
    join tbl_analysis_entities ae using (dataset_id)
    join tbl_physical_samples ps using (physical_sample_id)
    join tbl_sample_groups sg using (sample_group_id)
    where master_set_id = 2
),
sead_mal_references as (
    select distinct biblio_id
    from sample_groups
    join tbl_sites s using (site_id)
    join tbl_site_references sr using (site_id)
    union
    select distinct biblio_id
    from sample_groups
    join tbl_sample_group_references sgr using (sample_group_id)    
)
    select b.biblio_id, b.year, b.title, r.title, regexp_replace(lower(b.title), '\W+', '', 'g') as compressed
    from tbl_biblio b
    join sead_mal_references smr using (biblio_id)
    left join temp_mal_report_register r
      on regexp_replace(lower(r.title), '\W', '', 'g') = regexp_replace(lower(b.title), '\W', '', 'g')
    where r.title is not null
    order by b.year