Show #awsmegatests results on pipeline page

ewels commented 4 years ago

See https://github.com/nf-core/awsmegatests/issues/15

We will soon start having the results from the full-size test datasets stored on s3. It would be great to make these visible through the nf-core website as easily as possible.

Need a few things here:

[x] Bucket setup / permissions for access, preferably without any auth at all
[x] Standardised file structure on s3 so that we can predict / find results
[ ] Server side / client side code to get file listing
[ ] Server side / client side code to fetch files and render in the browser?

Note that we could potentially dodge the last two by just cheating and using an iframe to the s3 browser? Not sure how this would work.

ggabernet commented 4 years ago

Hi, the bucket is public now, and the standardized file structure for the files is:

nf-core-awsmegatests.s3-eu-west-1.amazonaws.com//results-<commit-tag>/

Under this is the results folder structure of each of the pipelines. For example, for the RNAseq pipeline, the MultiQC report is here for one of the tests:

https://nf-core-awsmegatests.s3-eu-west-1.amazonaws.com/rnaseq/results-022dc193f476130e83da8c2bbee088467ad20e16/MultiQC/multiqc_report.html

I can see that figuring out the commit hash is not trivial, so for the full size datasets that are triggered only on releases, I could adapt the workflow so that the file structure is:

nf-core-awsmegatests.s3-eu-west-1.amazonaws.com//results-<release-tag>/

ewels commented 4 years ago

No, the commit hash is easy - we already fetch that from the GitHub API and save it in https://nf-co.re/pipelines.json so should be fairly trivial to link hash to release.

What's harder is to get the listing of the files. The bucket doesn't allow simply looking at the directory, you need to know the full file path to see the file. So we need to build some kind of in-page file browser.

Phil

ggabernet commented 4 years ago

ah ok, that's good to know thanks!

I see that knowing the results folder structure might be a problem. We could start by displaying the MultiQC reports, which all pipelines should have under the same folder. For the rest of the results, would it help if we had a kind of results structure json?

ewels commented 4 years ago

Not necessarily - for example some proteomics pipelines don't have MultiQC reports at all.

I think that we need to actually query the s3 bucket and show a file browser. There are loads of AWS SDKs around though - for example, this one for JavaScript lists objects.

Better still would be to use an existing file-browser JavaScript library. It would need some customisation to work with different bucket directories, potentially load files into an iFrame and other stuff, but it's just a case of finding the right library. A quick google suggests that there are plenty kicking around (eg. here, here, here, jQuery plugin here). Just a question of playing around with them and finding the best one for our use case.

Phil

ewels commented 4 years ago

ps. It would be suuuper cool if we get this implemented in the next couple of weeks or so, before my AWS webinar 👀 😅

@mashehu do you have time / fancy having a play around with this? I'm thinking that it maybe deserves a new tab - Example Results? It could of course be a new auto-generated page section under Outputs, but I'm a bit worried that it's quite a large feature to add to an already large page.

mashehu commented 4 years ago

I can have a go with it next week. But I am not (yet) familiar with AWS, so I will probably ask a lot of questions in the slack channel 🙂

I agree with the extra tab, with maybe a link to the output tab and vice versa.

ggabernet commented 4 years ago

Hi @mashehu @ewels,

regarding listing the files, I've played with the boto3 python library that can be used to list all files in the bucket, also filtering by prefixes.

import boto3
s3 = boto3.resource('s3')

my_bucket = s3.Bucket('nf-core-awsmegatests')

for object in my_bucket.objects.filter(Prefix='viralrecon/results-f8b874fbef10fa0e76cf19002e61c81bf26678de'):
  print(object)

I've tested it and this lists all the results files for that run of viralrecon. E.g.:

s3.ObjectSummary(bucket_name='nf-core-awsmegatests', key='viralrecon/results-f8b874fbef10fa0e76cf19002e61c81bf26678de/variants/varscan2/snpeff/sample2.snpEff.summary.html')
s3.ObjectSummary(bucket_name='nf-core-awsmegatests', key='viralrecon/results-f8b874fbef10fa0e76cf19002e61c81bf26678de/variants/varscan2/snpeff/sample2.snpEff.vcf.gz')
s3.ObjectSummary(bucket_name='nf-core-awsmegatests', key='viralrecon/results-f8b874fbef10fa0e76cf19002e61c81bf26678de/variants/varscan2/snpeff/sample2.snpEff.vcf.gz.tbi')
s3.ObjectSummary(bucket_name='nf-core-awsmegatests', key='viralrecon/results-f8b874fbef10fa0e76cf19002e61c81bf26678de/variants/varscan2/snpeff/sample2.snpSift.table.txt')

It is then also possible to download these files with boto3. For authentication, it expects the AWS credentials for that account to be in the home directory or passed as env variables: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#guide-configuration

If it is a public bucket maybe that is not needed though, we gotta experiment here.

ggabernet commented 4 years ago

However, I would not know how to integrate this to display the files in the nf-core website so that's when your magic comes into play 😄

nf-core / website

Show #awsmegatests results on pipeline page #435