gigascience / gigadb-website

Source code for running GigaDB
http://gigadb.org
GNU General Public License v3.0
9 stars 15 forks source link

Link previews E597 #428

Open only1chunts opened 4 years ago

only1chunts commented 4 years ago

User Story

As a website user I want to see a preview of any link on the dataset pages So that I have an idea of what that link's website is about

Acceptance Criteria

Given I have three non-GigaDB links associated to a dataset When I navigate to the dataset pages Then I should see metadata from the metatags, Open Graph and twitter card (thumbnail, site summary,...) of the target links

Addtional infos

Is your feature request related to a problem? Please describe. Currently any links to additional info get added to GigaDB as basic URL links with no descriptions or anything. The addition of description is in another ticket (#61 ) It would be very nice to have a "preview" of the website its linking to, similar to the sort of thing that appears in facebook , twitter etc when you add a URL to a comment there.

Describe the solution you'd like On a dataset page (e.g. http://gigadb.org/dataset/100482) there are often "Additional information" links, in this example there are 4: https://pypi.org/project/ANNOgesic/ http://annogesic.readthedocs.io/en/latest/subcommands.html http://annogesic.readthedocs.io/en/latest/required.html https://hub.docker.com/r/silasysh/annogesic/

It would be nice to arrange these into square preview panels to look something like this: NB two of the additional links are to different pages in the "read the docs" so I've just added it once here. link-previews-mockup

Maybe this https://metatags.io/ could be useful ?

Additional context Whatever solution is found here, it may also be useful for the GitHub preview widget ticket? (#266 ) In the above example there is also a GitHub link: for example https://github.com/gigascience/gigadb-website , using metatag.io looks like this: github-preview

This Story is part of Epic #597

kencho51 commented 4 years ago

Hi @pli888 and @rija ,

I have done some background research on how to implement a link preview , which requires either installing different tools and modules or using API. But I am not sure these are the correct way to implement this feature. Please have a look and advise.

rija commented 4 years ago

Hi @only1chunts

Reading the requirements raises these questions:

Being interested in original state and wanting to show previews for all additional info links implies we will have to store the previews in GigaDB (and figure out how to store these previews) with change to the schema so the link can locate where the preview is.

Being interested in current state and wanting that for all additional info links will cause the dataset view page to be slow to load because the page will have to calculate the current preview for all the additional info links.

All use cases I've seen of link previews only load the previews when one particular link is becoming the focus of the user (by hovering over the link, or sharing the link)

So for us, a more efficient UX would be to have a "preview" button next to each link or enable preview upon hovering.

Preview on hovering will be slow if the preview has to be calculated in realtime.

only1chunts commented 4 years ago

Hi @rija ,

If storing the preview image is problematic or slows loading times too much then we could forgo the images and just use the google search-results style previews that have the link and short description. (see example previews on https://metatags.io/)

If we store the information for a preview-link would there be any provision for updates of those details either at fixed time points or as an admin-user induced refresh?

rija commented 4 years ago

Thanks @only1chunts for the clarification.

Generating a preview image yields different set of problems whether we store them or calculate them on the fly. And indeed if they are stored, it would make sense to be able to update these details.

However from looking at metatags.io, I've just realised that Facebook and Twitter do not necessarily generate preview images. They rely on semantic markup in the HTML of the target web site to pull out the preview information including the preview image.

Would it work for you if we take a similar approach:

In order to assess what the links are about without leaving the page, when a researcher loads a dataset page, for each link we fetch the preview metadata (if semantic markup is missing we fetch the meta-title and meta-description tags so we get at least the title and description) which will have the url of a preview image if the target web site has defined the semantic markup for it and we display them according to the mockup you've supplied.

Because we don't generate any images on the fly, there's no huge hit on dataset page loading performance. Because we don't store anything, there's no additional tooling or infrastructure to devise and deploy.

The connection to the target web sites could still impact a bit the loading time of the dataset page, especially if there are many links, but we could alleviate this with server-side caching of the preview metadata if it's a problem.

only1chunts commented 4 years ago

using the metatags would be fine I think. With the use of metatags in mind can we check that our datasets have all the appropriate metadata to enable others to generate nice previews of our pages? i.e. can we use the thumbnail images in each page to populate the relevant metatag field in the HTML code of the site?

rija commented 4 years ago

@only1chunts,

It should be feasible to add the image path of the main image (is that what you call thumbnail? ) in a set of semantic markups on each page. That would be a separate piece of work though.

What I was wondering is what is the value of #61 now? Couldn't we just use the same link preview feature for external link too ? Surely how a web site describe itself (and meta-description should always be there otherwise websites would rank badly on Google search results) is more accurate than whatever description we'd add manually?

@pli888,

I think this task is not small but it shouldn't be too difficult (i.e I don't see traps or dragons in there). Now that requirement is clearer, approach is clear and there are good precedents and examples in GigaDB.

This is about fetching and parsing medata information from other websites into a data layer, caching it, and doing formatting on the presentation layer. It's probably best that we divide the story in three tasks/tickets/PRs along the layers boundaries. It would then be easier to review constructively and merge, and It reduces the mental scope to something manageable for each task.

For example, first task could be to implement a StoredLinkPreview.php, second one to implement CachedLinkPreview.php, and third one to implement FormattedLinkPreview.php and the dataset/view.php changes.

@kencho51,

Great initiative! That blog post has indeed some good info as it alludes to why we can't do this on the frontend (in the browser) in general for arbitrary links (keywords here are CORS and WCP), and what metadata languages (HTML, OGP) we are dealing with.

The linkpreview API is not the right approach for a couple of reasons, chiefly because for each our additional information link, our page will need to make one connection to their API, which in turn will make another connection to the target web site. That's a lot of connections per link. Not even counting the time to generate previews, this will slow down the page heavily. The dataset page is one of the most important page of the site, it needs to load as fast as possible. Additionally, we do not control that API, we don't know the quality of their code or the stability and performance of their operations nor whether they respect their data policy. For all we know they could go out of business any moment or get hacked or be prone to bad quality connection. In order to alleviate some of the risks, we'd probably need a professional level subscription with guarantees, which mean we would need to pay a regular fee.

If you go to https://packagist.org (which list all PHP libraries available through Composer), and search for "meta-tags", "link preview", "microdata", "microformats", "schema" or "OGP", you will get a few results that may or may not be helpful to us in reducing the amount of code we need to implement. We would need to evaluate them.

only1chunts commented 4 years ago

@rija to save confusion about which image I mean i've put a red box around the image on an example page below: image For External Links (#61) having the same sort of preview but on a pop-up rather than a constant view should be fine, i just noticed that GitHub does this for ticket links !

image

I dont know if it makes any difference to anything, but maybe something to be aware of is that some external links get rendered into their own iframe tab.

rija commented 4 years ago

@only1chunts

Yes, that's the image I was thinking of as well. I was hesitant calling it thumbnail because it's not exactly thumb-sized and doesn't perform the function of a thumbnail image. We probably should find a better name for it at some point. Anyway, it's still a different user story from this ticket, feel free to file a new ticket, it's quite a small task.

Re #61, its seems then that ticket should just be updated to de-emphasize adding description and add mention of a pop-up link preview. (by the way when I was writing about "hovering", that's the behaviour I tried to describe) Re the iframe rendering, I don't know what does that imply, let's just add that info on the ticket so it can be investigated whenever that ticket is about to be worked on.

only1chunts commented 4 years ago

I've split the part about making our pages preview nicely in external tools like facebook & google to a new ticket #513

rija commented 4 years ago

Thanks @only1chunts.

Now that #85 and #513 have been updated and created respectively, I'd recommend for @kencho51 to work on them before this ticket: They are small and self-contained to the presentation layer and are good introduction to linked data applications. By the time this ticket is ready to be worked on, the domain will already be familiar.

kencho51 commented 4 years ago

Hi @rija Ok I will work on #85 and #513 first.

rija commented 4 years ago

@only1chunts @kencho51

User Story and acceptance tests:

Feature: Adding preview information for the links listed on dataset page under "Additional information"
As a researcher
I want to see preview information of links to additional information when I visit a dataset page
So that I can assess what all the links are about without leaving the dataset page

Background:
Given there are "Additional information" links associated with datasets:
| dataset | url | comment |
| 100002 | https://pypi.org/project/ANNOgesic/  | has all metadata  |
| 100002 | https://www.cell.com/ajhg/fulltext/S0002-9297(17)30074-5  | has all metadata  |
| 100002 | https://metatags.io/  | has all metadata |
| 100002| https://biocontainers.pro/#/registry  | has title, description |
| 100002 | http://annogesic.readthedocs.io/en/latest/subcommands.html  | has title |

Scenario: All links with title, description and image
Given I am not logged in
When I go to "/dataset/100002"
Then I should see all metadata for links under "Additional information"
| links |
| https://pypi.org/project/ANNOgesic/ |
| https://www.cell.com/ajhg/fulltext/S0002-9297(17)30074-5 |
| https://metatags.io/ |

Scenario: Show preview information for links with title and description only
Given I am not logged in
When I go to "/dataset/100002"
Then I should see the title for url "https://biocontainers.pro/#/registry" under  "Additional information"
Then I should see the description for url "https://biocontainers.pro/#/registry" under  "Additional information"

Scenario: Show preview information for links with title only
Given I am not logged in
When I go to "/dataset/100002"
Then I should see the title  for url "http://annogesic.readthedocs.io/en/latest/subcommands.html" under  "Additional information"

Implementation Notes

The implementation code will be organised using onion architecture patterns already used extensively in DatasetPageAssembly.php (used by DatasetController.php).

Here is an example from that class you can emulate as the flow and the nature of source (external web site) is similar:

/**
     * Create a connections dataset component to be use in a dataset page
     *
     * @return DatasetPageAssembly
     */
    public function setDatasetConnections(): DatasetPageAssembly
    {
        $this->_connections = new FormattedDatasetConnections(
                            $this->_app->getController(),
                        new CachedDatasetConnections (
                            $this->_app->getCache(),
                            $this->_cacheDependency,
                            new StoredDatasetConnections(
                                $this->_dataset->id,
                                $this->_app->getDb(),
                                new \GuzzleHttp\Client()
                            )
                    )
                );
        return $this;
    }

which is setup in DatasetController.php as such:


        // Assembling page components and page settings

        $assembly = DatasetPageAssembly::assemble($model, Yii::app(),$srv);
        $assembly->setDatasetSubmitter()
...
                    ->setDatasetConnections()
...

and is then used in dataset/view.php as such:

<div class="subsection">
                    <div class="underline-title">
                        <div>
                            <h4>Additional details</h4>
                        </div>
                    </div>
                    <?php
                    $publications = $connections->getPublications();
                    if (!empty($publications)) { ?>
                        <h5><strong><?= Yii::t('app' , 'Read the peer-reviewed publication(s):')?></strong></h5>
                        <p>
                            <? foreach ($publications as $publication){
                                echo $publication['citation'].$publication['pmurl'];
                                echo "<br/>";
                            }
                            ?>
                        </p>
                    <?php } ?>

The implementation can be divided in five parts:

rija commented 4 years ago

1. Implement protected/components/StoredDatasetLinksPreview.php

If a dataset has links of "Additional information" type, it will retrieve the meta-data associated with each of these links in order to construct a preview meta-data data structure.

There is no need to store that data structure in the database because the main method will be called by a decorator object's method of the same name that will first try to read it from cache if it's there, otherwise the decorator will fall back to this class's method and store it in the PHP cache ; also because (and the reason the class has a Stored prefix) , the web or the cloud is already considered a form of (remote) persistence storage in this context.

The mechanism to make a call to the link's web site and to fetch and parse its metadata , is quite a common functionality, so you could use a Composer library that satisfies these criteria:

Note: If you want to implement the mechanism by yourself, you would follow this algorithm :

  • Fetch the content at an arbitrary URL (should support http/https, urls redirection, deal with HTTP errors)
  • Parse title, description and image url from the page by merging the available data from any of the supported metadata language with HTML meta-tags as last resort fallback
  • Make sure title and description are sanitised text

The main public method in the class can be named getPreviewDataForLinks() and have the following signature:

public function getPreviewDataForLinks(): array

The returned array will be an array of associative arrays with these keys:

This array will be used in the next sub-task.

Have a look at StoredDatasetConnections.php's getPublications() method for inspiration as the flow is similar. For the unit tests, StoredDatasetConnectionsTest.php's testStoredReturnsPublications() is likewise a good example (notice how the http client is being mocked. Will need to mock the Composer library in same way, as we don't want to use our resource to test third party codebase - which is why we want them to have their own tests)

You can ignore the interface (implements DatasetConnectionsInterface) for the moment (that will dealt in the next sub-task).

pli888 commented 3 years ago

@rija

The implementation code will be organised using onion architecture patterns already used extensively in DatasetPageAssembly.php (used by DatasetController.php).

In the develop branch, DatasetPageAssembly class does not exist so @kencho51 would need to create it and then use it from DatasetController.php.

rija commented 3 years ago

@pli888, @kencho51

You're right, DatasetPageAssembly is not on develop

Here's the code for that class in my feature branch (I've highlighted the code for external link as an example): https://github.com/rija/gigadb-website/blob/1eb4a945658ad7eca9544ceb2e6aef9044782033/protected/components/DatasetPageAssembly.php#L195-L222

And here's a version of DatasetController making use of it (notice the highlighted code): https://github.com/rija/gigadb-website/blob/5241dc1f97cc15772b97af93707bdabb8dc398a4/protected/controllers/DatasetController.php#L88-L98

However, the changes to DatasetPageAssembly is not part of the first few tasks. So in theory, we don't need to deal with it until the fifth task and the classes for the storage, cache and formatted layers are pretty much the same on develop and my branch.

That said, to avoid confusion, to increase familiarity and to ease the implementation, it's probably better to hold off this work until the PR #546 with my changes is reviewed and merged to develop.

kencho51 commented 3 years ago

Hi @rija, @pli888

2. Implement protected/components/CachedDatasetLinksPreview.php

CachedDatasetLinksPreview.php will cache the follow information:

array(
    'short_doi'=>'100249',
    'external_url'=>'http://foo6.com',
    'type'=>'3D Models',
    'external_title'=>'Exercise generates immune cells in bone',
    'external_description'=>'Mechanosensing stem-cell niche promotes lymphocyte production.',
    'external_imageUrl'=>'https://media.nature.com/lw1024/magazine-assets/d41586-021-00419-y/d41586-021-00419-y_18880568.png',
)

3. Implement protected/components/FormattedDatasetLinksPreview.php

FormattedDatasetLinksPreview.php will format the cached contents as below:

array(
    'preview_title'=>'<a href="http://foo6.com">Exercise generates immune cells in bone</a>',
    'preview_description'=>'<p>Mechanosensing stem-cell niche promotes lymphocyte production.</p>',
    'preview_imageUrl'=>'<a href="http://foo6.com">'.'<img src="https://media.nature.com/lw1024/magazine-assets/d41586-021-00419-y/d41586-021-00419-y_18880568.png" alt="Go to site"/></a>',
)

The unit tests for CachedDatasetLinksPreview and FormattedDatasetLinksPreview are passing.


Work will continue after #546 has been merged.