kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

amalgkit sanity does not recognize safely_removed flag for single-end RNA-seq #80

Closed kfuku52 closed 2 years ago

kfuku52 commented 2 years ago

@Hego-CCTB Could you fix it? Also, sanity shouldn't exit every time it detects an anomaly. Otherwise, you are able to recognize and fix the problems one at a time.

https://github.com/kfuku52/amalgkit/blob/896f72526fe66add194aed30a28e04cc5781e512/amalgkit/util.py#L66-L67

amalgkit sanity --metadata ./amalgkit_out/metadata/metadata/metadata_all.tsv  --out_dir ./amalgkit_out --all
amalgkit sanity: start
reading metadata from: ./amalgkit_out/metadata/metadata/metadata_all.tsv
Checking essential entries from metadata file.
4  species detected:
['Arabidopsis thaliana' 'Cephalotus follicularis' 'Dionaea muscipula'
 'Nepenthes gracilis']
320  SRA runs detected:
['C1D' 'C1F' 'C1L' 'C1P' 'C1T' 'C1W' 'C2D' 'C2F' 'C2L' 'C2P' 'C2T' 'C2W'
 'C3D' 'C3F' 'C3L' 'C3P' 'C3T' 'C3W' 'Cf_1_1' 'Cf_1_2' 'Cf_1_3' 'Cf_1_4'
 'Cf_1_5' 'Cf_1_6' 'Cf_2_1' 'Cf_2_2' 'Cf_2_3' 'Cf_2_4' 'Cf_2_5' 'Cf_2_6'
 'Cf_3_1' 'Cf_3_2' 'Cf_3_3' 'Cf_3_4' 'Cf_3_5' 'Cf_3_6' 'DRR053690'
 'DRR053691' 'DRR053692' 'DRR053693' 'DRR053694' 'DRR053695' 'DRR053696'
 'DRR053697' 'DRR053698' 'DRR053699' 'DRR053700' 'DRR053701' 'DRR053702'
 'DRR053703' 'DRR053704' 'DRR053705' 'ERR4508085' 'ERR4508086'
 'ERR4508087' 'F1D' 'F1F' 'F1L' 'F1P' 'F1T' 'F1W' 'F2D' 'F2F' 'F2L' 'F2P'
 'F2T' 'F2W' 'F3D' 'F3F' 'F3L' 'F3P' 'F3T' 'F3W' 'R1' 'R3' 'R5'
 'SRR1659921' 'SRR1660397' 'SRR1661473' 'SRR1661475' 'SRR1661477'
 'SRR1661483' 'SRR1688325' 'SRR1688327' 'SRR1688425' 'SRR1688427'
 'SRR2073141' 'SRR2073142' 'SRR2073144' 'SRR2073174' 'SRR2073179'
 'SRR2795277' 'SRR2807621' 'SRR2807622' 'SRR2807623' 'SRR2807624'
 'SRR2807625' 'SRR2807626' 'SRR2807627' 'SRR2807628' 'SRR2807629'
 'SRR2807630' 'SRR2807631' 'SRR2807632' 'SRR2807633' 'SRR2807634'
 'SRR2807635' 'SRR2807636' 'SRR2807637' 'SRR2807638' 'SRR2807639'
 'SRR2807640' 'SRR2807641' 'SRR2807642' 'SRR2807643' 'SRR2807644'
 'SRR2807648' 'SRR2807649' 'SRR2807650' 'SRR2807651' 'SRR2807652'
 'SRR2807653' 'SRR2807654' 'SRR2807655' 'SRR2807656' 'SRR2807657'
 'SRR2807658' 'SRR2807659' 'SRR2807660' 'SRR2807661' 'SRR2807662'
 'SRR3581336' 'SRR3581345' 'SRR3581346' 'SRR3581347' 'SRR3581352'
 'SRR3581356' 'SRR3581383' 'SRR3581388' 'SRR3581499' 'SRR3581591'
 'SRR3581639' 'SRR3581672' 'SRR3581676' 'SRR3581678' 'SRR3581679'
 'SRR3581680' 'SRR3581681' 'SRR3581682' 'SRR3581683' 'SRR3581684'
 'SRR3581685' 'SRR3581686' 'SRR3581687' 'SRR3581688' 'SRR3581689'
 'SRR3581690' 'SRR3581691' 'SRR3581692' 'SRR3581693' 'SRR3581694'
 'SRR3581695' 'SRR3581696' 'SRR3581697' 'SRR3581698' 'SRR3581699'
 'SRR3581700' 'SRR3581701' 'SRR3581702' 'SRR3581703' 'SRR3581704'
 'SRR3581705' 'SRR3581706' 'SRR3581707' 'SRR3581708' 'SRR3581709'
 'SRR3581710' 'SRR3581711' 'SRR3581712' 'SRR3581713' 'SRR3581714'
 'SRR3581715' 'SRR3581716' 'SRR3581717' 'SRR3581719' 'SRR3581720'
 'SRR3581721' 'SRR3581724' 'SRR3581726' 'SRR3581727' 'SRR3581728'
 'SRR3581730' 'SRR3581731' 'SRR3581732' 'SRR3581733' 'SRR3581734'
 'SRR3581735' 'SRR3581736' 'SRR3581737' 'SRR3581738' 'SRR3581740'
 'SRR3581833' 'SRR3581834' 'SRR3581835' 'SRR3581836' 'SRR3581837'
 'SRR3581838' 'SRR3581839' 'SRR3581840' 'SRR3581841' 'SRR3581842'
 'SRR3581843' 'SRR3581844' 'SRR3581845' 'SRR3581846' 'SRR3581847'
 'SRR3581848' 'SRR3581849' 'SRR3581850' 'SRR3581851' 'SRR3581852'
 'SRR3581853' 'SRR3581854' 'SRR3581855' 'SRR3581856' 'SRR3581857'
 'SRR3581858' 'SRR3581859' 'SRR3581860' 'SRR3581861' 'SRR3581862'
 'SRR3581863' 'SRR3581864' 'SRR3581865' 'SRR3581866' 'SRR3581867'
 'SRR3581868' 'SRR3581869' 'SRR3581870' 'SRR3581871' 'SRR3581872'
 'SRR3581873' 'SRR3581874' 'SRR3581875' 'SRR3581876' 'SRR3581877'
 'SRR3581878' 'SRR3581879' 'SRR3581880' 'SRR3581881' 'SRR3581882'
 'SRR3581883' 'SRR3581884' 'SRR3581885' 'SRR3581886' 'SRR3581887'
 'SRR3581888' 'SRR3581889' 'SRR3581890' 'SRR3581891' 'SRR3581892'
 'SRR3581893' 'SRR3581894' 'SRR3581895' 'SRR3581896' 'SRR3581897'
 'SRR3581898' 'SRR3581899' 'SRR3724649' 'SRR3724650' 'SRR3724651'
 'SRR3724652' 'SRR3724663' 'SRR3724668' 'SRR3724737' 'SRR3724738'
 'SRR3724739' 'SRR3724741' 'SRR3724768' 'SRR3724774' 'SRR3724778'
 'SRR3724782' 'SRR3724785' 'SRR3724786' 'SRR3724787' 'SRR3724798'
 'SRR3724806' 'SRR3724814' 'SRR3725446' 'SRR3725458' 'SRR3725471'
 'SRR3725482' 'SRR3725493' 'SRR3725503' 'SRR3725516' 'SRR3725527'
 'SRR3725538' 'SRR3725550' 'SRR3725560' 'SRR3725561' 'SRR8834203'
 'SRR8834204' 'SRR8834205' 'SRR8834206' 'SRR8834207' 'SRR8834208'
 'SRR8834209' 'SRR8834210' 'SRR8834211' 'SRR8834212' 'SRR8834213'
 'SRR8834214' 'SRR8834215' 'SRR8834216' 'SRR8834217' 'SRR8834218'
 'SRR8834219' 'SRR8834220' 'SRR8834221' 'SRR8834222']
checking for getfastq outputs: 
amalgkit getfastq output folder detected. Checking presence of output files.

Looking for  SRR3581686
found:  ['./amalgkit_out/getfastq/SRR3581686/SRR3581686.amalgkit.fastq.gz']
checking for updated metadata in:  ./amalgkit_out/metadata/updated_metadata/metadata_SRR3581686.tsv
found updated metadata!

Looking for  SRR3581852
getfastq safely_removed flag was detected. `amalgkit quant` has been completed in this sample: ./amalgkit_out/getfastq/SRR3581852
./amalgkit_out/getfastq/SRR3581852/SRR3581852.amalgkit.fastq.gz.safely_removed
getfastq output could not be found in: ./amalgkit_out/getfastq/SRR3581852, layout = single
Hego-CCTB commented 2 years ago

So, safely_removed means that the sample has a quant ouptut, but the original fastq file has been deleted?

kfuku52 commented 2 years ago

Yes, the "original" doesn't apply for private fastq though.

Hego-CCTB commented 2 years ago

The desired behaviour would then be, that if the safely_removed flag is set, sanity shouldn't treat that sample as anomalous, if it can't find getfastq output, correct?

So this is desired:

Looking for SRR3581852 getfastq safely_removed flag was detected. amalgkit quant has been completed in this sample: ./amalgkit_out/getfastq/SRR3581852 ./amalgkit_out/getfastq/SRR3581852/SRR3581852.amalgkit.fastq.gz.safely_removed

But this isn't:

getfastq output could not be found in: ./amalgkit_out/getfastq/SRR3581852, layout = single

kfuku52 commented 2 years ago

Exactly!

Hego-CCTB commented 2 years ago

Okay, I'm on it!

Hego-CCTB commented 2 years ago

this is the new output:


amalgkit sanity: start
reading metadata from: /Users/s229181/Desktop/metadata/metadata/safely_remove_test.tsv
Checking essential entries from metadata file.
1  species detected:
['Populus trichocarpa']
4  SRA runs detected:
['ERR4131602' 'ERR4131603' 'ERR4131604' 'ERR4131605']
checking for getfastq outputs: 
amalgkit getfastq output folder detected. Checking presence of output files.

Looking for  ERR4131605
found:  ['./getfastq/ERR4131605/ERR4131605.amalgkit.fastq.gz.safely_removed']
fastq files for  ERR4131605  were safely removed by amalgkit quant.
checking for updated metadata in:  ./metadata/updated_metadata/metadata_ERR4131605.tsv
found updated metadata!

Looking for  ERR4131604
found:  ['./getfastq/ERR4131604/ERR4131604.amalgkit.fastq.gz.safely_removed']
fastq files for  ERR4131604  were safely removed by amalgkit quant.
checking for updated metadata in:  ./metadata/updated_metadata/metadata_ERR4131604.tsv
found updated metadata!

Looking for  ERR4131603
found:  ['./getfastq/ERR4131603/ERR4131603.amalgkit.fastq.gz.safely_removed']
fastq files for  ERR4131603  were safely removed by amalgkit quant.
checking for updated metadata in:  ./metadata/updated_metadata/metadata_ERR4131603.tsv
found updated metadata!

Looking for  ERR4131602
found:  ['./getfastq/ERR4131602/ERR4131602.amalgkit.fastq.gz.safely_removed']
fastq files for  ERR4131602  were safely removed by amalgkit quant.
checking for updated metadata in:  ./metadata/updated_metadata/metadata_ERR4131602.tsv
found updated metadata!
Sequences found for all SRA IDs in  /Users/s229181/Desktop/metadata/metadata/safely_remove_test.tsv  !

Looking for Index file ./Index/Populus_trichocarpa* for species:  Populus trichocarpa
Found  ['./Index/Populus_trichocarpa.idx'] !
Index found for all species in  /Users/s229181/Desktop/metadata/metadata/safely_remove_test.tsv  !
checking for quant outputs: 
amalgkit quant output folder detected. Checking presence of output files.

Looking for  ERR4131605
Found output folder  ./quant/ERR4131605  for  ERR4131605
Checking for output files.
./quant/ERR4131605/ERR4131605_abundance.h5  is missing! Please check if quant ran correctly

Looking for  ERR4131604
Found output folder  ./quant/ERR4131604  for  ERR4131604
Checking for output files.
./quant/ERR4131604/ERR4131604_abundance.h5  is missing! Please check if quant ran correctly

Looking for  ERR4131603
Found output folder  ./quant/ERR4131603  for  ERR4131603
Checking for output files.
./quant/ERR4131603/ERR4131603_abundance.h5  is missing! Please check if quant ran correctly

Looking for  ERR4131602
Found output folder  ./quant/ERR4131602  for  ERR4131602
Checking for output files.
./quant/ERR4131602/ERR4131602_abundance.h5  is missing! Please check if quant ran correctly
writing SRA IDs without quant output to:  ./sanity/SRA_IDs_without_quant.txt
Time elapsed: 0 sec
amalgkit sanity: end
getfastq safely_removed flag was detected. `amalgkit quant` has been completed in this sample: ./getfastq/ERR4131605
./getfastq/ERR4131605/ERR4131605.amalgkit.fastq.gz.safely_removed
getfastq safely_removed flag was detected. `amalgkit quant` has been completed in this sample: ./getfastq/ERR4131604
./getfastq/ERR4131604/ERR4131604.amalgkit.fastq.gz.safely_removed
getfastq safely_removed flag was detected. `amalgkit quant` has been completed in this sample: ./getfastq/ERR4131603
./getfastq/ERR4131603/ERR4131603.amalgkit.fastq.gz.safely_removed
getfastq safely_removed flag was detected. `amalgkit quant` has been completed in this sample: ./getfastq/ERR4131602
./getfastq/ERR4131602/ERR4131602.amalgkit.fastq.gz.safely_removed

Process finished with exit code 0```
Hego-CCTB commented 2 years ago

Because get_newest_intermediate_file_extension writes safely_removed into stderr, it will output that error message at the very end. Do we need to output

getfastq safely_removed flag was detected. `amalgkit quant` has been completed in this sample: ./getfastq/ERR4131605
./getfastq/ERR4131605/ERR4131605.amalgkit.fastq.gz.safely_removed

into stderr? It's not really an error.

Hego-CCTB commented 2 years ago

Also, unrelated: Newer versions of kallisto don't produce .h5 output files anymore. Should I update sanity to be kallisto version specific, or just stop looking for .h5 files alltogether?

kfuku52 commented 2 years ago

We don't need that stderr. You should hack get_newest_intermediate_file_extension() to be compatible with sanity or create a new function.

Which version of kallisto are you using? .h5 was produced with 0.46.2 and I thought that was the latest.

kfuku52 commented 2 years ago

OK, it seems like you didn't have HDF5. Did you compile kallisto manually? https://github.com/pachterlab/kallisto/releases/tag/v0.46.2

Hego-CCTB commented 2 years ago

OK, it seems like you didn't have HDF5. Did you compile kallisto manually? https://github.com/pachterlab/kallisto/releases/tag/v0.46.2

Yeah, I was wondering this as well. But this should be a conda installation. Just double checked, it's a manual installation. Probably missed the HD5 option.

But in any case, it sounds like HD5 will be phased out eventually.

Hego-CCTB commented 2 years ago

Alright. Should be fixed in https://github.com/kfuku52/amalgkit/commit/69c46e3020f8ff73d74b61c6fdc37e854e1e7ea2

Tested with both single and paired end libraries:


getfastq safely_removed flag was detected. `amalgkit quant` has been completed in this sample: ./getfastq/ERR4131602
./getfastq/ERR4131602/ERR4131602.amalgkit.fastq.gz.safely_removed
checking for updated metadata in:  ./metadata/updated_metadata/metadata_ERR4131602.tsv
found updated metadata!

Looking for  SRR14322310
getfastq safely_removed flag was detected. `amalgkit quant` has been completed in this sample: ./getfastq/SRR14322310
./getfastq/SRR14322310/SRR14322310_2.amalgkit.fastq.gz.safely_removed
./getfastq/SRR14322310/SRR14322310_1.amalgkit.fastq.gz.safely_removed
checking for updated metadata in:  ./metadata/updated_metadata/metadata_SRR14322310.tsv
found updated metadata!
Sequences found for all SRA IDs in  /Users/s229181/Desktop/metadata/metadata/safely_remove_test.tsv  !
kfuku52 commented 2 years ago

Another error occurred with the latest version

Looking for  SRR3581852
getfastq safely_removed flag was detected. `amalgkit quant` has been completed in this sample: /lustre7/home/lustre4/kfuku/my_project/nepenthes_gracilis/20211013_RNAseq/amalgkit_out/getfastq/SRR3581852
/lustre7/home/lustre4/kfuku/my_project/nepenthes_gracilis/20211013_RNAseq/amalgkit_out/getfastq/SRR3581852/SRR3581852.amalgkit.fastq.gz.safely_removed
getfastq output could not be found in: /lustre7/home/lustre4/kfuku/my_project/nepenthes_gracilis/20211013_RNAseq/amalgkit_out/getfastq/SRR3581852, layout = single
Traceback (most recent call last):
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/bin/amalgkit", line 378, in <module>
    args.handler(args)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/bin/amalgkit", line 81, in command_sanity
    sanity_main(args)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/amalgkit/sanity.py", line 234, in sanity_main
    check_getfastq_outputs(args, sra_ids, metadata, output_dir)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/amalgkit/sanity.py", line 58, in check_getfastq_outputs
    ext = get_newest_intermediate_file_extension(sra_stat, sra_path)
  File "/home/kfuku/.pyenv/versions/miniconda3-4.3.30/lib/python3.8/site-packages/amalgkit/util.py", line 67, in get_newest_intermediate_file_extension
    return ext_out
UnboundLocalError: local variable 'ext_out' referenced before assignment
Hego-CCTB commented 2 years ago

return ext_out shouldn't be in line 67 any more.

Ah. I forgot to increase the version. Maybe it didn't properly update on your end, because it didn't find a new version.

Hego-CCTB commented 2 years ago

I commited the new init.py

kfuku52 commented 2 years ago

sanity worked, thank you!