nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
885 stars 702 forks source link

Remove copyTo call for iGenomes README #896

Closed drpatelh closed 1 year ago

drpatelh commented 1 year ago

Description of the bug

We need to remove / refactor this call because it raises an error when accessing buckets across regions with AWS https://github.com/nf-core/rnaseq/blob/e049f51f0214b2aef7624b9dd496a404a7c34d14/workflows/rnaseq.nf#L59-L64

image

Maybe we document this instead and provide a command that can be used to fetch the README if required via the aws cli

$ aws s3 cp --no-sign-request s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Annotation/README.txt .
download: s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Annotation/README.txt to ./README.txt

~ cat README.txt
The contents of the annotation directories were downloaded from Ensembl on: July 17, 2015.

Gene annotation files were downloaded from Ensembl release 75. SmallRNA annotation files were downloaded from miRBase release 21.

Command used and terminal output

No response

Relevant files

No response

System information

No response

robsyme commented 1 year ago

I mid-term solution might be to improve copyTo to detect bucket regions and in cases where regions don't match, download and then re-upload files.

The copyTo method calls down to FileHelper::copyPath and then (on S3, at least) to S3FileSystemProvider::copy.

To the copy method linked above, I was hoping that we could add to something like

AmazonS3Client client = s3Source.getFileSystem().getClient();

String sourceBucketLocation = client.getBucketLocation(s3Source.getBucket());
String targetBucketLocation = client.getBucketLocation(s3Target.getBucket());

if(sourceBucketLocation != targetBucketLocation) {
  Path staged = <some temporary destination in work/stage>
  download(source, staged)
  upload(staged, target)
} else {
  <the other tests already in the copy method>
}

My only hesitation is that the AWS documentation about the GetBucketLocation method suggests that to use the API, "you must be the bucket owner". I'm unsure why knowing the bucket location should be a secrete available only to the bucket owner though.

We can verify this restriction with some testing of iGenomes data though.

drpatelh commented 1 year ago

Not sure it is worth all of this hassle to get a single file from an AWS bucket? Just means we can keep things simple and less prone to breakage in the pipeline trying to accommodate downloading the file. Docs is definitely the way ahead and we can extend on the commands I posted in the first comment and add directly to the website in the section below: https://github.com/nf-core/nf-co.re/blob/master/markdown/usage/reference_genomes.md#illumina-igenomes

This would allow us to have a central location that we can point to without needing to add to every pipeline.

drpatelh commented 1 year ago

Fixed in https://github.com/nf-core/rnaseq/pull/908

drpatelh commented 1 year ago

Docs added to main website here and available below: https://nf-co.re/docs/usage/reference_genomes