qbicsoftware / data-download-server

MIT License
0 stars 0 forks source link

Download only specific files from a measurement #19

Open KochTobi opened 4 months ago

KochTobi commented 4 months ago

Is your feature request related to a problem? Please describe.

Projects with many samples can produce a large amount of data in one measurement. To analyse effects only observed in a subset of the generated data, a download of the whole measurement leads to a download containing all files. Many of the files are not of interest for the problem analysis. Problems could be data corruption or data quality issues. Issues could occur during muliplexed runs where only a subset is of interest.

Describe the solution you'd like As files of interest are known beforehand, a filter for a part of the filename would help to only download files of interest.

Describe alternatives you've considered @qbicStefanC any ideas?

Additional context

---
title: Example
---
flowchart LR
   s1("sample 1") --> sequencer 
   s2("sample 2") --> sequencer 
   s3("sample 3") --> sequencer 
   sequencer --> multi_out("multiplexed output (BCL)")
   multi_out --> demultiplex("de-multiplexing")
   demultiplex --> o11("2020-01-01_sample_1_L001.fastq.gz")
   demultiplex --> o12("2020-01-01_sample_1_L002.fastq.gz")
   demultiplex -- "error" --> o13("2020-01-01_sample_1_L003.fastq.gz")
   demultiplex --> o21("2020-01-01_sample_2_L001.fastq.gz")
   demultiplex --> o22("2020-01-01_sample_2_L002.fastq.gz")
   demultiplex --error--> o23("2020-01-01_sample_2_L003.fastq.gz")
   demultiplex --> o31("2020-01-01_sample_3_L001.fastq.gz")
   demultiplex --> o32("2020-01-01_sample_3_L002.fastq.gz")
   demultiplex --error--> o33("2020-01-01_sample_3_L003.fastq.gz")
qbicStefanC commented 4 months ago

Most of it looks good. i am not sure 100% about the 'muliplexed runs' thing though, what is meant by this. From what i know is that usually from a let's say Miseq BCL file (multiplexed), demultiplexed fastq files (1-many files per sample barcode) can be produced. Thus demultiplexing is the step from BCL to fastq.

qbicStefanC commented 4 months ago

Example would more be of this: Screenshot 2024-07-09 at 11 00 55

although this is also simplified. But the key is: it is lane003 in this example across samples. Let assume all files of lane003 might be corrupted and should be investigated in this case. then a file name search with a regex for "L003" would help.

sven1103 commented 3 months ago

To me it looks like the download API can make use of a query parameter e.g. filterType and filter.

I suggest:

filterType:

filter: