IsmailM commented 3 years ago

Previous Plan

the initial plan was to use Client => AWS S3 uploads - i.e. instead of uploading files to the webserver and then S3, upload them directly to S3.

In this mode, Client would still need to contact the server to get a presigned AWS S3 upload URL...

Current Plan

As files are small (i.e. <100MB), We can use classical file upload:

Client > Web Server > AWS S3

Here is an example: https://github.com/transloadit/uppy/tree/master/examples/python-xhr

@app.route('/upload', methods=['POST'])
def upload_file():
    if request.method == 'POST':
        # check if the post request has the file part
        print request.files
        if len(request.files) == 0:
          return jsonify(
              error="No file n request"
            ), 400
        for fi in request.files:            
          file = request.files[fi]
          if file and allowed_file(file.filename):
               // Upload file to S3 here
              return jsonify(message="ok"), 201

See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html on how to upload files to S3 from the webserver..

Few points - look into secure_filename from flask.

Or maybe ignore the user-provided uploaded filename and generate a new filename based on phenopolis patient id - and type...

IsmailM commented 3 years ago

I have created a bucket on S3: s3://phenopolis-website-uploads

Which you should be able to access using the AWS access + secret access keys that you all have.

You should be able to see this in the command line with (I am using --profile phenopolis as I have multiple AWS accounts):

aws s3 --profile phenopolis ls s3://phenopolis-website-uploads

S3 Folder Structure

Would be good to have your input on this @logust79 @pontikos

I think this is important as in the future we may upload other types of files (e.g eye scans etc.)

I think there are two options:

Organise everything by Phenopolis Individual ID
- and then have sub-folders for data types (E.g. PH00001/scans and PH00001/vcf/)
Organise everything data type:

So we have multiple options:

PH000001/vcf/PH000001_2020_12_11_15_30.vcf.gz
vcf/PH000001_2020_12_11_15_30.vcf.gz

So in the future we could have:

PH000001/oct/PH000001_2020_12_11_15_30.fda

Filenames

Personally, I think the best solution is not to encode id + DateTime into the filename, but rather use a UUID:

And then add a table in the database where we store original filename + the UUID + which individual the file refers to.

logust79 commented 3 years ago

Currently I arrange the structure like PH0000001/bqsr.bam, PH00000001/vep.json etc. Do we have a point where subfolder makes more sense, like PH0000001/bam/*bam* and PH0000001/vcf/*.vcf.*?

IsmailM commented 3 years ago

I think we should future proof this as much as possible - it will be a lot of effort to change this later

pontikos commented 3 years ago

Subfolders will make sense if we need dates, like for OCT scans.

logust79 commented 3 years ago

Currently I only work with four folders, or maybe more, i.e. rawreads (fastq), bam, variantCall, annotation, and some other analyses or post-annotation processing such as exomiser and vep2tsv.

logust79 commented 3 years ago

I'll adapt the nextflow code once we agree on folder structure.

pontikos commented 3 years ago

ok @logust79 please come up with a suggestion that makes sense for you.

I think the key difference for you is that you want the filenames to be the same across all patients but that the folder names can change?

logust79 commented 3 years ago

That's right. As long as the sub-path for a file is the same across all individuals I'm happy. So for now I would suggest all fastq files go to fastq, all bam files go to bam, all annotated files go to annotation, and for each analysis the result goes to the name of the analysis (such as exomiser). What do you think?

ok @logust79 please come up with a suggestion that makes sense for you.

I think the key difference for you is that you want the filenames to be the same across all patients but that the folder names can change?

pontikos commented 3 years ago

Ok so all files will have the same name but be in different PHxxx folders?

E.g : PH00001/fastq/reads1.fq.gz PH00001/bam/aligned.bam PH00002/fastq/reads1.fq.gz PH00002/bam/aligned.bam

logust79 commented 3 years ago

Yes! Though fastq files are not required to have the same file name. (the pipeline reads the fastq file names from a table. But would be good to have the same file name and the input would be simplified, too)

Ok so all files will have the same name but be in different PHxxx folders?

E.g : PH00001/fastq/reads1.fq.gz PH00001/bam/aligned.bam PH00002/fastq/reads1.fq.gz PH00002/bam/aligned.bam

IsmailM commented 3 years ago

Is it possible that we would run the analysis twice? or have multiple WES runs...

So maybe something like? PH00001/exome-1/fastq/reads1.fq.gz PH00001/exome-1/bam/aligned.bam

And then if we rerun stuff (or if we have new data etc for the same individual): PH00001/exome-2/bam/aligned.bam

What do you think?

logust79 commented 3 years ago

I like the idea, but it might introduce complexity to how the pipeline automates.

I guess the fastq files remain constant across runs, so might not need exome-1 in the path. As for the rest, we can do something like PH000001/bam/run-1/bqsr.bam. And when the pipeline runs, if the target file already exists, it automatically increments run count.

We might have some issue though if customer provides bam files.

Is it possible that we would run the analysis twice? or have multiple WES runs...

So maybe something like? PH00001/exome-1/fastq/reads1.fq.gz PH00001/exome-1/bam/aligned.bam

And then if we rerun stuff (or if we have new data etc for the same individual): PH00001/exome-2/bam/aligned.bam

What do you think?

IsmailM commented 3 years ago

So let's stick with that :)

So we will have files as follows:

# 
PH00001/fastq/reads1.fq.gz
PH00001/fastq/reads2.fq.gz
PH00001/bam/run-1/aligned.bam

# 
PH00002/fastq/reads1.fq.gz
PH00002/fastq/reads2.fq.gz
PH00002/bam/run-1/aligned.bam

If in the future, we need to support multiple input FASTQs (e.g. a WGS + exome), we can add the functionality at that time...

phenopolis / phenopolis_genomics_browser

Server side: Uploading and storing of files #299

Previous Plan

Current Plan

S3 Folder Structure

Filenames