Open kapsakcj opened 1 year ago
So good news and bad news...
Good news first:
fastq-dl --accession SRR25316086 --verbose --outdir sra-normalized --provider sra --only-provider
...
2023-08-02 22:40:31 DEBUG 2023-08-02 22:40:31:executor.process:DEBUG - Got return code 0 from synchronous process (bash -c 'prefetch SRR25316086 --max-size 10T -o SRR25316086.sra'). __init__.py:1638
DEBUG 2023-08-02 22:40:31:root:DEBUG - fastq_dl.py:92
DEBUG 2023-08-02 22:40:31:root:DEBUG - 2023-08-02T22:40:25 prefetch.3.0.6: Current preference is set to retrieve SRA Normalized Format files with full base quality scores. fastq_dl.py:93
2023-08-02T22:40:25 prefetch.3.0.6: 1) Downloading 'SRR25316086'...
2023-08-02T22:40:25 prefetch.3.0.6: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2023-08-02T22:40:25 prefetch.3.0.6: Downloading via HTTPS...
2023-08-02T22:40:30 prefetch.3.0.6: HTTPS download succeed
2023-08-02T22:40:31 prefetch.3.0.6: 'SRR25316086' is valid
2023-08-02T22:40:31 prefetch.3.0.6: 1) 'SRR25316086' was downloaded successfully
2023-08-02T22:40:31 prefetch.3.0.6: 'SRR25316086' has 0 unresolved dependencies
...
zcat sra-normalized/SRR25316086_1.fastq.gz | fastq-scan -q
{
"qc_stats": {
"total_bp": 150934381,
"coverage": 0.00,
"read_total": 1010912,
"read_min": 35,
"read_mean": 149.305,
"read_std": 9.49214,
"read_median": 151,
"read_max": 151,
"read_25th": 150,
"read_75th": 151,
"qual_min": 2,
"qual_mean": 36.9532,
"qual_std": 2.18818,
"qual_max": 38,
"qual_median": 38,
"qual_25th": 37,
"qual_75th": 38
}
}
Q -scores range from 2-38, so this should get you want you need. However I can add a way to allow the user to switch between SRA Normalized and SRA Lite, with Normalized being the default.
Now the bad news:
fastq-dl --accession SRR25316086 --verbose --outdir ena
...
zcat ena/SRR25316086_1.fastq.gz | fastq-scan -q
{
"qc_stats": {
"total_bp": 150934381,
"coverage": 0.00,
"read_total": 1010912,
"read_min": 35,
"read_mean": 149.305,
"read_std": 9.49214,
"read_median": 151,
"read_max": 151,
"read_25th": 150,
"read_75th": 151,
"qual_min": 30,
"qual_mean": 30,
"qual_std": 0,
"qual_max": 30,
"qual_median": 30,
"qual_25th": 30,
"qual_75th": 30
}
}
# Force SRA Lite
vdb-config --simplified-quality-scores yes
fastq-dl --accession SRR25316086 --verbose --outdir sra-lite --provider sra --only-provider
...
DEBUG 2023-08-02 22:44:13:root:DEBUG - 2023-08-02T22:44:11 prefetch.3.0.6: Current preference is set to retrieve SRA Lite files with simplified base quality scores. fastq_dl.py:93
2023-08-02T22:44:11 prefetch.3.0.6: 1) Downloading 'SRR25316086.lite'...
2023-08-02T22:44:11 prefetch.3.0.6: SRA Lite file is being retrieved, if this is different from your preference, it may be due to current file availability.
2023-08-02T22:44:11 prefetch.3.0.6: Downloading via HTTPS...
2023-08-02T22:44:12 prefetch.3.0.6: HTTPS download succeed
2023-08-02T22:44:13 prefetch.3.0.6: 'SRR25316086.lite' is valid
2023-08-02T22:44:13 prefetch.3.0.6: 1) 'SRR25316086.lite' was downloaded successfully
2023-08-02T22:44:13 prefetch.3.0.6: 'SRR25316086' has 0 unresolved dependencies
...
zcat sra-lite/SRR25316086_1.fastq.gz | fastq-scan -q
{
"qc_stats": {
"total_bp": 150934381,
"coverage": 0.00,
"read_total": 1010912,
"read_min": 35,
"read_mean": 149.305,
"read_std": 9.49214,
"read_median": 151,
"read_max": 151,
"read_25th": 150,
"read_75th": 151,
"qual_min": 30,
"qual_mean": 30,
"qual_std": 0,
"qual_max": 30,
"qual_median": 30,
"qual_25th": 30,
"qual_75th": 30
}
}
It looks like ENA synced the SRA Lite version of the reads, and not the Normalized. This was also the case for SRR13086318
.
Hmmm, this bugs me because I usually use ENA as the default provider because they provide FASTQs directly. But I also want the original quality scores which SRA sync'd reads may or may not provide. The blog post above has a October 2021 date, so I'm unsure if after this date the reads synced from SRA to ENA have the SRA Lite Q scores.
I'm wondering if a solution might be to add a third provider: source
and based on the accession download from the original provider (e.g. SRR from SRA, ERR from ENA, DRR either SRA or ENA).
I'm wondering if a solution might be to add a third provider: source and based on the accession download from the original provider (e.g. SRR from SRA, ERR from ENA, DRR either SRA or ENA).
This sounds like the best first pass solution to me
Thanks for the quick reply & brainstorming on solutions.
Just wanted to share this example where despite using the options Robert suggested, it still seemed to download SRA Lite formatted FASTQs. Even though the output explicitly states SRA Normalized Format file is being retrieved
. Maybe I just got unlucky with this particular accession?
# fastq-dl v2.0.2 installed via mamba
$ fastq-dl -a SRR13086318 --verbose --provider sra --only-provider
2023-08-03 10:09:37 DEBUG 2023-08-03 10:09:37:root:DEBUG - Querying ENA for metadata... fastq_dl.py:428
DEBUG 2023-08-03 10:09:37:root:DEBUG - --only-provider supplied, limiting queries to sra fastq_dl.py:431
DEBUG 2023-08-03 10:09:37:urllib3.connectionpool:DEBUG - Starting new HTTPS connection (1): eutils.ncbi.nlm.nih.gov:443 connectionpool.py:1003
2023-08-03 10:09:39 DEBUG 2023-08-03 10:09:39:urllib3.connectionpool:DEBUG - [https://eutils.ncbi.nlm.nih.gov:443](https://eutils.ncbi.nlm.nih.gov/) "POST /entrez/eutils/esearch.fcgi HTTP/1.1" 200 None connectionpool.py:456
DEBUG 2023-08-03 10:09:39:urllib3.connectionpool:DEBUG - Starting new HTTPS connection (1): eutils.ncbi.nlm.nih.gov:443 connectionpool.py:1003
2023-08-03 10:09:40 DEBUG 2023-08-03 10:09:40:urllib3.connectionpool:DEBUG - [https://eutils.ncbi.nlm.nih.gov:443](https://eutils.ncbi.nlm.nih.gov/) "GET connectionpool.py:456
/entrez/eutils/esummary.fcgi?db=sra&usehistory=n&retmode=json&query_key=1&WebEnv=MCID_64cbb522a06d0e3d496a66e5&retstart=0&retmax=500
HTTP/1.1" 200 None
DEBUG 2023-08-03 10:09:40:urllib3.connectionpool:DEBUG - Starting new HTTPS connection (1): eutils.ncbi.nlm.nih.gov:443 connectionpool.py:1003
2023-08-03 10:09:41 DEBUG 2023-08-03 10:09:41:urllib3.connectionpool:DEBUG - [https://eutils.ncbi.nlm.nih.gov:443](https://eutils.ncbi.nlm.nih.gov/) "GET connectionpool.py:456
/entrez/eutils/esearch.fcgi?db=sra&usehistory=n&retmode=json&term=SRR13086318 HTTP/1.1" 200 None
DEBUG 2023-08-03 10:09:41:urllib3.connectionpool:DEBUG - Starting new HTTPS connection (1): eutils.ncbi.nlm.nih.gov:443 connectionpool.py:1003
2023-08-03 10:09:42 DEBUG 2023-08-03 10:09:42:urllib3.connectionpool:DEBUG - [https://eutils.ncbi.nlm.nih.gov:443](https://eutils.ncbi.nlm.nih.gov/) "GET connectionpool.py:456
/entrez/eutils/efetch.fcgi?db=sra&usehistory=n&retmode=runinfo&query_key=1&WebEnv=MCID_64cbb5240bbbf858ca74f635&retstart=0&retmax=500
HTTP/1.1" 200 None
DEBUG 2023-08-03 10:09:42:urllib3.connectionpool:DEBUG - Starting new HTTPS connection (1): www.ebi.ac.uk:443 connectionpool.py:1003
2023-08-03 10:10:00 DEBUG 2023-08-03 10:10:00:urllib3.connectionpool:DEBUG - [https://www.ebi.ac.uk:443](https://www.ebi.ac.uk/) "GET connectionpool.py:456
/ena/portal/api/filereport?result=read_run&fields=fastq_ftp&accession=SRP074197 HTTP/1.1" 200 None
2023-08-03 10:10:10 INFO 2023-08-03 10:10:10:root:INFO - Query: SRR13086318 fastq_dl.py:629
INFO 2023-08-03 10:10:10:root:INFO - Archive: sra fastq_dl.py:630
INFO 2023-08-03 10:10:10:root:INFO - Total Runs To Download: 1 fastq_dl.py:635
INFO 2023-08-03 10:10:10:root:INFO - Working on run SRR13086318... fastq_dl.py:654
DEBUG 2023-08-03 10:10:10:executor.process:DEBUG - Executing external command: bash -c 'prefetch SRR13086318 --max-size 10T -o SRR13086318.sra' __init__.py:1475
DEBUG 2023-08-03 10:10:10:executor.process:DEBUG - Constructing subprocess.Popen object .. __init__.py:1483
DEBUG 2023-08-03 10:10:10:executor.process:DEBUG - Joining synchronous process using subprocess.Popen.communicate() .. __init__.py:1504
2023-08-03 10:10:14 DEBUG 2023-08-03 10:10:14:executor.process:DEBUG - Got return code 0 from synchronous process (bash -c 'prefetch SRR13086318 --max-size 10T -o __init__.py:1638
SRR13086318.sra').
DEBUG 2023-08-03 10:10:14:root:DEBUG - fastq_dl.py:92
DEBUG 2023-08-03 10:10:14:root:DEBUG - 2023-08-03T14:10:10 prefetch.3.0.3: Current preference is set to retrieve SRA Normalized Format files with full fastq_dl.py:93
base quality scores.
2023-08-03T14:10:11 prefetch.3.0.3: 1) Downloading 'SRR13086318'...
2023-08-03T14:10:11 prefetch.3.0.3: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to
current file availability.
2023-08-03T14:10:11 prefetch.3.0.3: Downloading via HTTPS...
2023-08-03T14:10:14 prefetch.3.0.3: HTTPS download succeed
2023-08-03T14:10:14 prefetch.3.0.3: 'SRR13086318' is valid
2023-08-03T14:10:14 prefetch.3.0.3: 1) 'SRR13086318' was downloaded successfully
2023-08-03T14:10:14 prefetch.3.0.3: 'SRR13086318' has 0 unresolved dependencies
DEBUG 2023-08-03 10:10:14:executor.process:DEBUG - Executing external command: bash -c 'fasterq-dump SRR13086318 --split-3 --mem 1G --threads 1' __init__.py:1475
DEBUG 2023-08-03 10:10:14:executor.process:DEBUG - Constructing subprocess.Popen object .. __init__.py:1483
DEBUG 2023-08-03 10:10:14:executor.process:DEBUG - Joining synchronous process using subprocess.Popen.communicate() .. __init__.py:1504
2023-08-03 10:10:35 DEBUG 2023-08-03 10:10:35:executor.process:DEBUG - Got return code 0 from synchronous process (bash -c 'fasterq-dump SRR13086318 --split-3 --mem 1G __init__.py:1638
--threads 1').
DEBUG 2023-08-03 10:10:35:root:DEBUG - fastq_dl.py:92
DEBUG 2023-08-03 10:10:35:root:DEBUG - spots read : 841,910 fastq_dl.py:93
reads read : 1,683,820
reads written : 1,683,820
DEBUG 2023-08-03 10:10:35:executor.process:DEBUG - Executing external command: bash -c 'pigz --force -p 1 -n SRR13086318*.fastq' __init__.py:1475
DEBUG 2023-08-03 10:10:35:executor.process:DEBUG - Constructing subprocess.Popen object .. __init__.py:1483
DEBUG 2023-08-03 10:10:35:executor.process:DEBUG - Joining synchronous process using subprocess.Popen.communicate() .. __init__.py:1504
2023-08-03 10:12:39 DEBUG 2023-08-03 10:12:39:executor.process:DEBUG - Got return code 0 from synchronous process (bash -c 'pigz --force -p 1 -n SRR13086318*.fastq'). __init__.py:1638
DEBUG 2023-08-03 10:12:39:root:DEBUG - fastq_dl.py:92
DEBUG 2023-08-03 10:12:39:root:DEBUG - fastq_dl.py:93
INFO 2023-08-03 10:12:39:root:INFO - Writing metadata to /home/curtis_kapsak/fastq-run-info.tsv
$ zcat SRR13086318_1.fastq.gz |fastq-scan
{
"qc_stats": {
"total_bp": 192698955,
"coverage": 0.00,
"read_total": 841910,
"read_min": 100,
"read_mean": 228.883,
"read_std": 38.6772,
"read_median": 250,
"read_max": 251,
"read_25th": 222,
"read_75th": 251,
"qual_min": 3,
"qual_mean": 29.9999,
"qual_std": 0.0416146,
"qual_max": 30,
"qual_median": 30,
"qual_25th": 30,
"qual_75th": 30
},
"read_lengths": {
"100": 1159, "101": 1237, "102": 1143, "103": 990, "104": 1329,
"105": 1231, "106": 1197, "107": 1044, "108": 1226, "109": 1266,
"110": 1199, "111": 1124, "112": 1253, "113": 1170, "114": 1132,
"115": 1124, "116": 1056, "117": 1028, "118": 1116, "119": 1113,
"120": 1139, "121": 1197, "122": 1168, "123": 1148, "124": 1285,
"125": 1283, "126": 1340, "127": 1340, "128": 1304, "129": 1405,
"130": 1337, "131": 1402, "132": 1283, "133": 1420, "134": 1419,
"135": 1292, "136": 1269, "137": 1470, "138": 1316, "139": 1420,
"140": 1266, "141": 1525, "142": 1364, "143": 1362, "144": 1303,
"145": 1594, "146": 1476, "147": 1649, "148": 1514, "149": 1557,
"150": 1462, "151": 1724, "152": 1417, "153": 1816, "154": 1795,
"155": 1804, "156": 1665, "157": 2004, "158": 2032, "159": 2004,
"160": 1640, "161": 2193, "162": 1625, "163": 1670, "164": 1613,
"165": 1594, "166": 1563, "167": 1631, "168": 1621, "169": 1623,
"170": 1573, "171": 1633, "172": 1654, "173": 1899, "174": 1673,
"175": 1796, "176": 1953, "177": 1847, "178": 1907, "179": 1890,
"180": 1864, "181": 2255, "182": 1879, "183": 1819, "184": 1945,
"185": 1838, "186": 1755, "187": 1846, "188": 1913, "189": 1957,
"190": 1962, "191": 1870, "192": 1929, "193": 2045, "194": 1976,
"195": 1955, "196": 2215, "197": 2395, "198": 2043, "199": 2259,
"200": 2263, "201": 2345, "202": 2163, "203": 2153, "204": 2315,
"205": 2301, "206": 2100, "207": 2245, "208": 2160, "209": 2326,
"210": 2242, "211": 2238, "212": 2724, "213": 2575, "214": 2492,
"215": 2469, "216": 2608, "217": 2425, "218": 2451, "219": 2524,
"220": 2715, "221": 2799, "222": 2629, "223": 2698, "224": 2714,
"225": 2648, "226": 2529, "227": 2688, "228": 2517, "229": 2445,
"230": 2549, "231": 2554, "232": 2485, "233": 2465, "234": 2700,
"235": 2721, "236": 2836, "237": 2681, "238": 3081, "239": 3061,
"240": 2965, "241": 2855, "242": 3209, "243": 2864, "244": 2955,
"245": 2928, "246": 3447, "247": 5498, "248": 18243, "249": 39943,
"250": 246256, "251": 254088
},
"per_base_quality": {
"1": 29.9999, "2": 29.9999, "3": 29.9999, "4": 29.9999, "5": 29.9999,
"6": 29.9999, "7": 29.9999, "8": 29.9999, "9": 29.9999, "10": 29.9999,
"11": 29.9999, "12": 29.9999, "13": 29.9999, "14": 29.9999, "15": 29.9999,
"16": 29.9999, "17": 29.9999, "18": 29.9999, "19": 29.9999, "20": 29.9999,
"21": 29.9999, "22": 29.9999, "23": 29.9999, "24": 29.9999, "25": 29.9999,
"26": 29.9999, "27": 29.9999, "28": 29.9999, "29": 29.9999, "30": 29.9999,
"31": 29.9999, "32": 29.9999, "33": 29.9999, "34": 29.9999, "35": 29.9999,
"36": 29.9999, "37": 29.9999, "38": 29.9999, "39": 29.9999, "40": 29.9999,
"41": 29.9999, "42": 29.9999, "43": 29.9999, "44": 29.9999, "45": 29.9999,
"46": 29.9999, "47": 29.9999, "48": 29.9999, "49": 29.9999, "50": 29.9999,
"51": 29.9999, "52": 29.9999, "53": 29.9999, "54": 29.9999, "55": 29.9999,
"56": 29.9999, "57": 29.9999, "58": 29.9999, "59": 29.9999, "60": 29.9999,
"61": 29.9999, "62": 29.9999, "63": 29.9999, "64": 29.9999, "65": 29.9999,
"66": 29.9999, "67": 29.9999, "68": 29.9999, "69": 29.9999, "70": 29.9999,
"71": 29.9999, "72": 29.9999, "73": 29.9999, "74": 29.9999, "75": 29.9999,
"76": 29.9999, "77": 29.9999, "78": 29.9999, "79": 29.9999, "80": 29.9999,
"81": 29.9999, "82": 29.9999, "83": 29.9999, "84": 29.9999, "85": 29.9999,
"86": 29.9999, "87": 29.9999, "88": 29.9999, "89": 29.9999, "90": 29.9999,
"91": 29.9999, "92": 29.9999, "93": 29.9999, "94": 29.9999, "95": 29.9999,
"96": 29.9999, "97": 29.9999, "98": 29.9999, "99": 29.9999, "100": 29.9999,
"101": 29.9999, "102": 29.9999, "103": 29.9999, "104": 29.9999, "105": 29.9999,
"106": 29.9999, "107": 29.9999, "108": 29.9999, "109": 29.9999, "110": 29.9999,
"111": 29.9999, "112": 29.9999, "113": 29.9999, "114": 29.9999, "115": 29.9999,
"116": 29.9999, "117": 29.9999, "118": 29.9999, "119": 29.9999, "120": 29.9999,
"121": 29.9999, "122": 29.9999, "123": 29.9999, "124": 29.9999, "125": 29.9999,
"126": 29.9999, "127": 29.9999, "128": 29.9999, "129": 29.9999, "130": 29.9999,
"131": 29.9999, "132": 29.9999, "133": 29.9999, "134": 29.9999, "135": 29.9999,
"136": 29.9999, "137": 29.9999, "138": 29.9999, "139": 29.9999, "140": 29.9999,
"141": 29.9999, "142": 29.9999, "143": 29.9999, "144": 29.9999, "145": 29.9999,
"146": 29.9999, "147": 29.9999, "148": 29.9999, "149": 29.9999, "150": 29.9999,
"151": 29.9999, "152": 29.9999, "153": 29.9999, "154": 29.9999, "155": 29.9999,
"156": 29.9999, "157": 29.9999, "158": 29.9999, "159": 29.9999, "160": 29.9999,
"161": 29.9999, "162": 29.9999, "163": 29.9999, "164": 29.9999, "165": 29.9999,
"166": 29.9999, "167": 29.9999, "168": 29.9999, "169": 29.9999, "170": 29.9999,
"171": 29.9999, "172": 29.9999, "173": 29.9999, "174": 29.9999, "175": 29.9999,
"176": 29.9999, "177": 29.9999, "178": 29.9999, "179": 29.9999, "180": 29.9999,
"181": 29.9999, "182": 29.9999, "183": 29.9999, "184": 29.9999, "185": 29.9999,
"186": 29.9999, "187": 29.9999, "188": 30, "189": 30, "190": 30,
"191": 30, "192": 30, "193": 30, "194": 30, "195": 30,
"196": 30, "197": 30, "198": 30, "199": 30, "200": 30,
"201": 30, "202": 30, "203": 30, "204": 30, "205": 30,
"206": 30, "207": 30, "208": 30, "209": 30, "210": 30,
"211": 30, "212": 30, "213": 30, "214": 30, "215": 30,
"216": 30, "217": 30, "218": 30, "219": 30, "220": 30,
"221": 30, "222": 30, "223": 30, "224": 30, "225": 30,
"226": 30, "227": 30, "228": 30, "229": 30, "230": 30,
"231": 30, "232": 30, "233": 30, "234": 30, "235": 30,
"236": 30, "237": 30, "238": 30, "239": 30, "240": 30,
"241": 30, "242": 30, "243": 30, "244": 30, "245": 30,
"246": 30, "247": 30, "248": 30, "249": 30, "250": 29.9999,
"251": 29.9999
}
}
OK, yup I think I just got unlucky with this particular accession: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR13086318&display=metadata
It seems to me that even the original FASTQs hosted on SRA are SRA Lite format. I tried using fastq-dump
and fasterq-dump
v3.0.6 and still got SRA Lite formatted FASTQs.
It's looking like, given fastq-dl
uses sra-tools
(specifically prefetch
and fasterq-dump
), the best I do is to make sure we've done everything we're supposed to do to get the SRA Normalized FASTQs.
Unfortunately, after that, not much can be done about what SRA is serving up. For SRR13086318
, it might be worth submitting a ticket to SRA and asking what's happening here.
Haha quite the can of worms that SRA Lite has opened!
agreed! Thank you for digging into this one. I will submit a ticket to the SRA helpdesk and see what they can tell me.
@kapsakcj as a band-aid, I released v2.0.3 which explicitly sets the preference to SRA Normalized by executing vdb-config --simplified-quality-scores no
before each SRA download.
I have to restructure things, when I do that I'll add the --provider source
option and likely a warning that the FASTQs might be SRA Lite derived if all the scores are Q30 (unless --sra-lite
option is used) .
This should at least allow you to move forward and know that we've provided SRA everything expected to get SRA Normalized format.
I've hit an odd issue where
fastq-dl
pulls FASTQs without issue, but they are in SRA Lite format instead of the typical SRA Normalized format.FASTQs in SRA Lite format have
?
for all Qscores for all bases, which equates to Q30. This leads to issues wheretrimmomatic
or other typical downstream softwares are unable to detect the Phred quality encoding and the Qscore are not useful during assembly (and probably other applications that utilize the Qscores)FASTQs in SRA Normalized are the original format that contains the full base quality scores
Some examples where I encountered this issue
I'm guessing it will be a big effort, but would it be possible for
fastq-dl
to download the SRA-normalized format of FASTQs?Not sure how ENA deals with this issue, but sra-toolkit has an option for using this format
More info: