plazi / BLR-website

1 stars 0 forks source link

Dashboards Specs #5

Open mguidoti opened 5 years ago

mguidoti commented 5 years ago

The look and the type of the dashbords (charts, bars, etc.) will be decided later.

You can design the look of the charts later, however, you will have to provide me with the content of the charts now as I am making the API that will let you retrieve the stats. Here are my suggestions in this regard:

Think of each chart as a packet of stats with its x and y values (assuming it is an x-y type of chart). I will send each of those packets in the API result. All the charts, in any view, will be relevant to that particular result set. So, think of the URI that selected the "treatments" (for example), returning a set of treatments – the charts shown will reflect that returned result set.

On the very first instance, there will not be any user input. So the charts will represent the entire result set, that is, the entire set of treatments. Since the various bundles of stats will be packaged in the result set, think of bundles that could be universally useful. That way the API will be useful to everyone beyond just this website. If you want to make something very customized for your own use, then it is better to make several API calls to get the info you want and then calculate your own specific results on your end.

Always think it terms of resources – treatments, publications, images, etc. The API retrieves resources, and each API call retrieves a specific state of a specific resource. If you want to show more than one resource on a page, you will have to make more than one API call, one for each resource. If you want to show more than one specific states of a resource, you will have to make more than one API call, one for each state. And so on.

If you make multiple API calls, you can wrap them in a Promise to get your composite results.

Hope this helps.

Originally posted by @punkish in https://github.com/plazi/BLR-website/issues/4#issuecomment-536931978

teodorgeorgiev commented 5 years ago

Hi @punkish, here are our first comments with regards to the "v2/treatments" endpoint.

It looks very promising!

I saw you have pagesize, pagenum, but I didn't saw the "sortBy" options.

For all facet groups we will need a list of values with the respective count, i.e.:

   facets: {
            journalTitles: {
                displayName: "Journal Titles",
                total: 524,
                data: [
                    {
                        displayName: Zookeys,
                        total: 10
                    },
            {
                        displayName: Zootaxa,
                        total: 16
                    },
                    ---
                ]
            },
          ---
    }

In the current design we have "Article Author". @mguidoti could you please check if that is correct? I guess it should be "authorityName"?

Here are the missing facet groups for the treatments:

            relatedMaterialCitations: {},
            relatedTreatmentCitations: {}, 
            hasFigures: {}, 
            collectionCodes: {}, 

Here are the missing fields for the treatment record:

records: [
            {

                figuresCnt: 10,
                materialsCnt: 10,
                externalLinks: {                    
                    plazi: {href:"", name:"Plazi"},
                    zenodo: {href:"", name:"Zenodo"},
                    gbif: {href:"", name:"GBIF"},                    
                },
                ...
            },
            ...
        ]

We have to discuss (@myrmoteras ?) if we want/can to spit the current "treatmentTitle" into "treatmentTaxon" and "treatmentAuthority", i.e.:

     "treatmentTitle": "Maratus felinus Schubert, 2019, sp. nov."
into
      "treatmentTaxon": "Maratus felinus", 
      "treatmentAuthority": "Schubert, 2019, sp. nov.", 

or we can simply change the design of a treatment in the list of results.

A general quetion with regards to the dashboards for the treatments, how do you plan to return them, as a separate endpoint (i.e. ​/v2​/treatmentsDasboards) or as a part of the "​/v2​/treatments" endpoint?

... more thought after the Biodiverity_Next ...

punkish commented 5 years ago

I saw you have pagesize, pagenum, but I didn't saw the "sortBy" options.

sortBy is something that is a client-side requirement. The API will provide the results sorted by the primary key. But since every client can have different requirements, and since it is trivial to do a JavaScript sort, that is best done by the client.

For all facet groups we will need a list of values with the respective count, i.e.:

   facets: {
            journalTitles: {
                displayName: "Journal Titles",
                total: 524,
                data: [
                    {
                        displayName: Zookeys,
                        total: 10
                    },
          {
                        displayName: Zootaxa,
                        total: 16
                    },
                    ---
                ]
            },
          ---
  }

I investigated providing counts. There are two issues here: One, looking at several implementations (Amazon comes to mind), facets don't show any counts. Two, I tried doing counts, but the performance is really bad (other than for the first run, which can be cached).

Interestingly, if you look at https://zenodo.org, it does provide facets with counts, but the counts are really misleading. As you click on the facets, the counts don't change. So, perhaps they are facing the same issue that I am facing – the first time the counts are probably cached so easily provided. But with every click, the result set becomes smaller and yet, the facet counts don't change. That gets really confusing for the user.

My suggestion, try using just the facets. With every click on a facet, a new result set will be fetched because the entire result set is bigger than just the pageSize worth of results that are displayed.

In the current design we have "Article Author". @mguidoti could you please check if that is correct? I guess it should be "authorityName"?

I am not sure what the above means. Can you clarify?

Here are the missing facet groups for the treatments:

            relatedMaterialCitations: {},
            relatedTreatmentCitations: {}, 
            hasFigures: {}, 
            collectionCodes: {}, 

The above are not fields in the treatments table. Please look at the treatments document that @tcatapano made. If the above are required as facets, I have to figure out how to provide them, if at all possible. For example, if you want the count of relatedMaterialCitations, we run into the same problem as I described above regarding counts.

Here are the missing fields for the treatment record:

records: [
            {

                figuresCnt: 10,
                materialsCnt: 10,
                externalLinks: {                    
                    plazi: {href:"", name:"Plazi"},
                    zenodo: {href:"", name:"Zenodo"},
                    gbif: {href:"", name:"GBIF"},                    
                },
                ...
            },
            ...
        ]

I will check if the above fields exist in the treatments table as is or if they have to be created. Will get back to you soon. Also, if the above fields can be returned, they have to be added to the specs document that @tcatapano made.

We have to discuss (@myrmoteras ?) if we want/can to spit the current "treatmentTitle" into "treatmentTaxon" and "treatmentAuthority", i.e.:

     "treatmentTitle": "Maratus felinus Schubert, 2019, sp. nov."
into
      "treatmentTaxon": "Maratus felinus", 
      "treatmentAuthority": "Schubert, 2019, sp. nov.", 

or we can simply change the design of a treatment in the list of results.

A general quetion with regards to the dashboards for the treatments, how do you plan to return them, as a separate endpoint (i.e. ​/v2​/treatmentsDasboards) or as a part of the "​/v2​/treatments" endpoint?

As I explained in an earlier post (have to find the reference), think of the dashboards as summary of the current result set (the result of any query). These summaries will be provided as a part of the treatments endpoint. There is no resource called treatmentsDashboards so that can't be an endpoint. The endpoint is only a legitimate resource, and for now they are, treatments, materialsCitations, figureCitations, bibRefCitations, treatmentCitations, and treatmentAuthors

punkish commented 5 years ago

hola @teodorgeorgiev, I have just pushed some improvements to Zenodeo. Please check out the facets being returned now. For example, https://zenodeo.punkish.org/v2/treatments returns the following (only part of the output shown below)

{
  "value": {
    "num-of-records": 308587,
    "search-criteria": {
      "page": "1",
      "size": "30",
      "limit": 30,
      "offset": 0
    },
    "_links": {
      "self": {
        "href": "https://zenodeo.punkish.org/v2/treatments?page=1&size=30"
      }
    },
    "facets": {
      "journalTitle": [
        {
          "journalTitle": "& al. • Phylogeny of Iresine and pollen evolution (Amaranthaceae)",
          "c": 36
        },
        {
          "journalTitle": "1",
          "c": 1
        },
        {
          "journalTitle": "AMERICAN MUSEUM NOVITATES",
          "c": 2
        },
        {
          "journalTitle": "AMERICAN MUSEUM Novitates",
          "c": 7
        },
        {
          "journalTitle": "Abhandlungen herausgegeben von der Senckenbergischen Naturforschenden Gesellschaft",
          "c": 1
        },
        {
          "journalTitle": "Abhandlungen und Berichte des Naturkundemuseums Görlitz",
          "c": 3
        },
        {
          "journalTitle": "Acarologia",
          "c": 4
        },
        {
          "journalTitle": "Acarology",
          "c": 4
        },
        {
          "journalTitle": "Acta Arachnologica",
          "c": 78
        },
        {
          "journalTitle": "Acta Arachnologica Sinica",
          "c": 2
        },
        {
          "journalTitle": "Acta Biol., Venez",
          "c": 70
        },
        {
          "journalTitle": "Acta Entomologica Musei Nationalis Pragae",
          "c": 6
        },

The performance is still not up to what I would call satisfactory, but the cached values are returned instantly, of course. I am going to continue to chip away to make this better.

cc @myrmoteras

howkins commented 4 years ago

Hi @punkish please check below our commments regarding the treatment endpoint - Teodor


Hi! I am Georgi from team of pensoft I saw changes of treatments endpoint for facets and think that is good except these missing resources

species: [ { species: :string c: integer }, ], journalVolume: [ { journalVolume: :string c: integer }, ], relatedMaterialCitations: { yes: integer, // count no: integer // count }, relatedTreatmentCitations: { yes: integer, // count no: integer // count }, hasFigures: { yes: integer, // count no: integer // count }, collectionCodes: { yes: integer, // count no: integer // count },

I saw you have pagesize, pagenum, but I didn't saw the "sortBy" options.

sortBy is something that is a client-side requirement. The API will provide the results sorted by the primary key. But since every client can have different requirements, and since it is trivial to do a JavaScript sort, that is best done by the client.

We expect sortBy request options to work for sorting all records. We can not sort from client because the set of results is just chunk from the whole set.

sortBy: [oneOf] ASC|DESC

Requirments could be see here

punkish commented 4 years ago

I have been testing various facets and I really don't think they make much sense as is. For example, I added species to the mix and almost 90,000 rows, many of them with really janky data. Tried journalVolume and got similar results… almost 4000 rows and meaningless numbers for volumes (journal volumes are, after all, just numbers – is it really meaningful to say that '38' occurs '72' times?). In any case, the biggest problem is the size of the result. When no params are provided, the default result set is almost 5 MB in size. You really don't want to be making users download 5 MB of data just to be able to populate their search widget. This has to be really rethought or scaled down in its ambitions.

Then there is the issue of relatedMaterialCitations, relatedTreatmentCitations, hasFigures, collectionCodes. These are not columns in the treatments table. I can get the numbers via joins, but they are not similar to the other facets. Even in terms of their structure in the JSON depicted above, they are just objects with 'yes' and 'no' values while the other facets are arrays of objects. Mixing data types for something that should be logically similar doesn't feel right.

punkish commented 4 years ago

Let's rethink this facets business. For starters, let's say you go to the website and hit search with no params provided. Think of this query as

SELECT Count(*) AS c FROM treatments;

The answer comes back, "There are 250000 treatments" and perhaps the first 30 treatments are shown. Note that the "first 30" is dependent on the sort order. But since the sort order is not provided, (kinda pointless when one is viewing only 30 records), the default sort order is the primary key.

Facets should allow you to narrow the result. But the facets themselves should not be overwhelming. For example, if all 250K treatments came from five journals, you could provide the names of those five journals and the number of treatments from each. Clicking on any one of those journals would give you the number of treatments from that journal. The effective SQL query would be

SELECT Count(*) AS c FROM treatments WHERE journal = ?;

Now imagine that instead of 5, all those 250K treatments came from 3000 different journals. There is no way you would provide a list of all those 3000 journals so the user could narrow the records. The web page would be a mess.

So, rethink the facets and use only those that result in a small number of distinct values.

howkins commented 4 years ago

This is the link to our test website: http://blr.uplaysandbox.website/ You can play with it and you can see what is available till now.

punkish commented 4 years ago

This is the link to our test website: http://blr.uplaysandbox.website/ You can play with it and you can see what is available till now.

I like it 👍 I am working on enhancements to the API and will update you soon

punkish commented 4 years ago

Hi @punkish please check below our commments regarding the treatment endpoint - Teodor

We expect sortBy request options to work for sorting all records. We can not sort from client because the set of results is just chunk from the whole set.

sortBy: [oneOf] ASC|DESC

  • treatmentAuthors
  • journalYear
  • materialsCitations
  • figureCitations
  • treatmentCitations?

Requirments could be see here

Wanted to let you know that I've got sorting working now although I haven't yet pushed the changes to the public API. Am still testing it. Hope to push it up by this weekend. In advance though, please see the following notes:

The following columns are not part of the 'treatments' table. They are related records for every treatment

I also don't have 'treatmentCitations' in my table. That leaves only 'journalYear' from what you asked for.

Remember, I can only sort by the columns in my table. The columns are

Thought sorting by many of the above may not make sense. Note that the default sort is by 'treatmentId' with sort order ASC. The syntax is

?sortBy=<column:DIR>

// for example

?sortBy=journalYear:ASC

More before this week ends.

punkish commented 4 years ago

hello @howkins and @teodorgeorgiev

apologies for the delay in delivering this API, but I've been busy with testing it and trying to make it fast enough to be usable. I am pushing a working version now but I want you to be aware of a breaking change that is easily fixed.

Now the treatments end-point (and eventually all the end-points) will not return facets and stats automatically. Instead, you will have to explicitly ask for them like so

/treatments?q=maratus&journalYear=2005&facets=true&stats=true

In other words, you have to append facets=true and/or stats=true for the API to return the respective data. This is because the same end-point can be used for other purposes where facets or stats may not be needed. And, actually facets are really burdensome both for querying and for sending back. On a default query (with no query params), the facets, as you guys want them, add almost 4.5MB to the data. This is really inefficient. But since I am not going to have a specific API for just facets (API end-points are only for nouns, resources such as images, treatments, publications, etc.)

So, with this caveat, I am pushing the changes now. Your BLR application will break because you are not asking for facets explicitly. Just add facets=true to the query string and all will be fine.

Many thanks for your patience as I have worked on this.

punkish commented 4 years ago

hello @howkins and @teodorgeorgiev,

I have pushed updates to the queries so that they are much faster. I have tried a few queries and there is no timeout happening anymore. But, please do check, and if you face a problem, please open an issue immediately. That is the only way I can solve things. I am hoping the query responses will be in the sub-second range, even with all the facets and stats, but that is an ambitious goal. Hopefully we can get there.

Many thanks

howkins commented 4 years ago

Hello @punkish I have tried to make most different queries and I found that is much better as productivity but some times I receive status code 504 Gateway Time-out for queries which i was executed with success before.

punkish commented 4 years ago

Thanks for the report. Could you please make a note of the queries that are giving a 504 and let me know. Basically the dB is taking too long to do those queries and the web server is timing out. In fact, if you run the exact query again, it will come back immediately because even though it timed out the first time, in the meantime the dB calculated the query and it was cached.

In any case, if you let me know the slow queries, I can check the specific indexes and try to speed them up.

On Jan 23, 2020, at 1:43 PM, Georgi Zhelezov notifications@github.com wrote:

 Hello @punkish I have tried to make most different queries and I found that is much better as productivity but some times I receive status code 504 Gateway Time-out for queries which i was executed with success before.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

howkins commented 4 years ago

Ok immediately when I receive this response again i will send you report with example.

I noticed another issue. When open this Query1 for example I see journalVolume like a option for search in facets but when i try to access it (Query2) i receive Object with statusCode: 400 Query1: https://zenodeo.punkish.org/v2/treatments?facets=true&stats=true&q=temnothorax

Query2: https://zenodeo.punkish.org/v2/treatments?facets=true&stats=true&journalVolume=11&q=temnothorax

punkish commented 4 years ago

thanks for reporting. This is because journalVolume was not a queryable column. I’ve now added it to the schema and it should work now. To see what columns are queryable, please see the docs at https://zenodeo.punkish.org/docs

On Jan 23, 2020, at 6:59 PM, Georgi Zhelezov notifications@github.com wrote:

Ok immediately when I receive this response again i will send you report with example.

I noticed another issue. When open this Query1 for example I see journalVolume like a option for search in facets but when i try to access it (Query2) i receive Object with statusCode: 400 Query1: https://zenodeo.punkish.org/v2/treatments?facets=true&stats=true&q=temnothorax

Query2: https://zenodeo.punkish.org/v2/treatments?facets=true&stats=true&journalVolume=11&q=temnothorax

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.