Closed myrmoteras closed 2 years ago
Thanks @millerjeremya for the very nice bug report. (Please note that an Ocellus-related issue should be on the ocellus repo – thanks @myrmoteras for moving the report here.)
First, let's address the issue of repeating images… I am aware of it. The reason for that is almost always because the image is actually a composite image (several plates together). Each plate has its own figure caption, and hence, they are all assigned a different figureCitationId
, and so they appear as separate records. I have been working (struggling a bit) to resolve this for the long-term, and I think I am making some good progress. But since I have to change the backend as well, and also (perhaps) remove the duplicates after doing the query but somehow retain the information that there are separate captions, this is taking a bit of time. Hopefully it will be fixed in an update to Zenodeo 3, and hence, in Ocellus.
Now, back to the issue of tables as images. The ability to extract tables as entities is fairly recent in TB. Ocellus is powered by Zenodeo, and Zenodeo gets its data from TB. I am now focusing on Zenodeo 3, and implementing the capability to update the db every night. Whatever changes are made available by TB, they appear in Zenodeo as well, and hence, in Ocellus. Unfortunately, this is a different data pipeline from TB→Zenodo. But, hopefully this will all be resolved once everything is working and reasonably up-to-date.
One thing I have thought about, keeping in mind the long-term-ness of the work, is whether I should eventually abandon the current approach I have, wherein Zenodeo (my API) ingests the data from TB. Importing data from TB gives me the freedom to work with the data however I want, and use various database capabilities that become available to me. But it increases my work and responsibility to keep the data in-sync. An alternative would be to redo the API so it queries Zenodo directly, acting as a bridge/filter between the user and Zenodo's (in my view, more idiosyncratic, ElasticSearch syntax). It would decrease the flexibility I have with the data, but eliminate an extra step thereby assuring that I always have the data as fresh as it is on Zenodo.
I hope I have clarified your doubts even though I don't have an immediate solution. I will keep this issue open, and when I have a working solution, I will update this issue so you are informed, and only then I will close it.
Thank for your clear description of where things stand currently and some possible futures. Looking forward to updates and further discussion.
hi @millerjeremya
I have just now pushed a brand new, completely rewritten version of Ocellus. There are no more image duplicates, and images from Zenodo and treatments are now combined for richer results.
In addition, a full slew of new ways to query are now possible. I have started providing some examples, though the detailed examples may be seen on Zenodeo website (all this is possible because of the brand new querying capabilities made possible by Zenodeo).
Please try out the application and let me know what you think. I look forward to your feedback.
Thanks,
Puneet
cc @myrmoteras
@millerjeremya if you want to understand more, then switch to https://test.zenodeo.org/ that explains what Zenodeo does.
in the example searches this should show up
@punkish when I follow this example, the T just shows a cursor (hand in my terminology) but nothing is happening
I'm exploring the new version. Looking good! I like the interactive "T" feature. When the identifier appears, it does seem to want to be clicked on. It looks like it doesn't currently have functionality, but the link to the treatment below is fine. If you indended that identifier to be clickable, I'm in Chrome on Windows 10. One thing about the functionality that I'm not clear on is where is it conducting its search - in the figure caption? in the citing treatment? For these purposes, I'm using the example search provided: tyrannosaurus&authorityName=Osborn I do miss the image count from the previous verion. Is there currently any difference between "all images" and "images on Zenodo" settings?
hola @millerjeremya,
thanks once again for the report… very helpful. Please see below for specific answers to your queries
I like the interactive "T" feature. When the identifier appears, it does seem to want to be clicked on. It looks like it doesn't currently have functionality, but the link to the treatment below is fine. If you indended that identifier to be clickable, I'm in Chrome on Windows 10.
For starters, the 'T' on the image signifies that the image is related to a treatment. Clicking on it should indeed reveal the treatmentId (for now). I believe @myrmoteras also reported that nothing was happening when he clicked on the 'T' but I was not able to reproduce it. Thanks for telling me your browser and OS info (very useful). I will try to reproduce the problem and apply a fix. Nevertheless, even if it doesn't reveal the treatmentId, the 'T' is still useful just as a small visual indicator that the image is from a treatment.
One thing about the functionality that I'm not clear on is where is it conducting its search - in the figure caption? in the citing treatment? For these purposes, I'm using the example search provided: tyrannosaurus&authorityName=Osborn
At its very basic, searching for any word or phrase results in a search against the entire text of the the records. So, in the above query, all the records are found where 'tyrannosaurus' appears anywhere in the entire text.
Then, if there is a '=' sign in the query, further filtering of the records takes places based on the key=value
pair, in the above case, where the "authorityName" starts with "Osborn". You can read more about this extended query syntax at https://test.zenodeo.org/query-help
Keep in mind, Ocellus is meant to be very simple to operate/query. Zenodeo, on the other hand, the API that powers Ocellus, can do much more granular searches. This trick of passing query parameters as key=value
pairs in Ocellus is just an added bonus. More on this below.
Is there currently any difference between "all images" and "images on Zenodo" settings?
Yes, of course. All images include images that were deposited on Zenodo even though they had nothing to do with treatments as well as images that are related to treatments. As the terms imply, "images from Zenodo" and "images from treatments" are just the respective sub-sets of "all images"
The actual search for "images from Zenodo" is performed on Zenodo and for "images from treatments is performed on Zenodeo (on a database that I have created using data from TreatmentBank). The two results are then combined and duplicates are removed before they are presented to the user.
Because I have more control on the searches that I can do on my database, for now at least, whenever the user does a search with the extended syntax (as you did above), the search is only against the "images from treatments". In the long run, I would like to enable extended syntax search on Zenodo as well.
So, to summarize, if you just search for a text string, the simplest search, a full text search performed and records are fetched from both treatment-related images as well as other images. On the other hand, if you add extra parameters to the query, then the search is only against the treatment-related images.
I do miss the image count from the previous verion.
Yes, I intend to bring that back. Currently I am facing difficulty in trying to figure out how to calculate and display the images counts given that there are two sources of images.
Please let me know if you have more questions.
@punkish when you write
All images include images that were deposited on Zenodo
does this mean all of Zenodo, or the Biosyslit community (eg out BLR)?
yes, of course, all images of BLR community
to expound a bit more on @millerjeremya's question on search, a full-text search is performed on literally the full text of the records (the treatments or the text associated with the non-treatment images). In the case of treatments, while most of the images are stored on Zenodo, some treatments, esp the most recent ones, may not have been deposited on Zenodo yet (but are already a part of my database). Their images are retrieved from the original source, for example, from the Pensoft website.
Also, a treatment can have many images. Or rather, it can have many references to the same image. As a figureCitation
, they are all different, but visually, it is the same image, and hence, looks like duplicates. Since Ocellus is a simple image viewer, the duplicates are removed. This is why I am still struggling a bit with displaying the image count… a search for "foo" might find 267 treatments which may have 328 images which, when de-duplicated might go down to 299. To make matters even more complicated, I only retrieve 30 records at a time, but 30 records each from Zenodo and Zenodeo adds up to 60, and those 60 records may have 78 images that de-duplicated might be 67. I hope to find a reasonable (sensible as well as visually pleasing) solution soon.
see #27
from @millerjermya
I am looking at the images found on Ocellus with a query for Tyrannosaurus rex: https://ocellus.info/images.html?q=Tyrannosaurus%20rex&size=30&page=1&communities=biosyslit
I notice a couple of issues that I would like to understand better.
First, I noticed some tables represented as figures. These appear to be legacy documents, arising because of historical problems with table markup that have since been resolved. The table is no longer marked as an image in the source document, http://tb.plazi.org/GgServer/summary/FFC0FFA05A3EFFF8D97CF9081F50FFEB http://treatment.plazi.org/id/03F987D8-5A3E-FFFB-D9E6-FC051AF4FC10 but is still an image on BLR. https://zenodo.org/record/3551688#.YfkckerMJPY
I have also notices that some images are repeated, for example, the following images all appear on the second page of search results but have the same source and DOI https://zenodo.org/record/3961061#.Yfk2IerMJPY https://zenodo.org/record/3961049#.Yfk2HOrMJPY https://zenodo.org/record/3739922#.Yfk2DurMJPY https://zenodo.org/record/3961073#.Yfk2A-rMJPY