Closed asuccurro closed 2 years ago
Hi Antonella, can you add the contents of processdata.sh? This error is thrown when an input file cannot be opened.
It's the file from github (unmodified)
#! /bin/bash
# Cleanup the working directory and ensure that it exists..
rm -Rf tmp
mkdir tmp
# Clean the input data before merging..
python csv-clean.py \
-i input/wggc2019_round1.csv \
-o tmp/wggc2019_round1.clean.csv \
-k confirm-or-change-email-address \
-c profiles-rns \
2> tmp/wggc2019_round1.dups.csv
python csv-clean.py \
-i input/wggc2019_round2.csv \
-o tmp/wggc2019_round2.clean.csv \
-k confirm-or-change-email-address \
-c profiles-rns \
2> tmp/wggc2019_round2.dups.csv
python csv-clean.py \
-i input/ngscn2021_full_form_20211117.csv \
-o tmp/ngscn2021_full_form_20211117.clean.csv \
-k confirm-or-change-email-address \
-c profiles-rns \
2> tmp/ngscn2021_full_form_20211117.dups.csv
python csv-clean.py \
-i input/ngscn2021_reduced_form_20211117.csv \
-o tmp/ngscn2021_reduced_form_20211117.clean.csv \
-k confirm-or-change-email-address \
2> tmp/ngscn2021_reduced_form_20211117.dups.csv
# Merge the input data files..
python csv-merge.py \
-ia tmp/wggc2019_round1.clean.csv \
-ib tmp/wggc2019_round2.clean.csv \
-o tmp/wggc2019.merge.csv \
-k confirm-or-change-email-address
python csv-filter.py \
-i tmp/wggc2019.merge.csv \
-o tmp/wggc2019.filter.csv \
-f input/wggc2019_curated_profiles_emails.txt \
-k confirm-or-change-email-address
python csv-merge.py \
-ia tmp/ngscn2021_full_form_20211117.clean.csv \
-ib tmp/ngscn2021_reduced_form_20211117.clean.csv \
-o tmp/ngscn2021.merge.csv \
-k confirm-or-change-email-address \
-d ngs-cc-affiliation:'WGGC' \
-d hide-contact-information:False \
-m buidling:building \
-m primary-affiliation:affiliation \
-m specify-affiliation-1:specify-affiliation \
-m your-role-s-at-the-wggc:your-role-s-at-the-ngs-cc \
-m specify-your-role-s-at-the-wggc:specify-your-role-s \
-m specify-city:city \
-m you-agree-to-publish-your-data-on-the-ngs-cn-profiles-web-portal-if-affiliated-with-wggc-also-on-the-wggc-profiles-web-portal:you-agree-to-our-terms-and-policy
python csv-merge.py \
-ia tmp/ngscn2021.merge.csv \
-ib tmp/wggc2019.filter.csv \
-o tmp/ngscn2021.merge.1.csv \
-k confirm-or-change-email-address \
-d ngs-cc-affiliation:'WGGC' \
-d hide-contact-information:False \
-m buidling:building \
-m primary-affiliation:affiliation \
-m specify-affiliation-1:specify-affiliation \
-m your-role-s-at-the-wggc:your-role-s-at-the-ngs-cc \
-m specify-your-role-s-at-the-wggc:specify-your-role-s \
-m specify-city:city \
-m you-agree-to-publish-your-data-on-the-ngs-cn-profiles-web-portal-if-affiliated-with-wggc-also-on-the-wggc-profiles-web-portal:you-agree-to-our-terms-and-policy
python csv-clean.py \
-i tmp/ngscn2021.merge.1.csv \
-o tmp/ngscn2021.clean.1.csv \
-k confirm-or-change-email-address \
-c profiles-rns
# Apply filters..
python csv-filter.py \
-i tmp/ngscn2021.clean.1.csv \
-o tmp/ngscn2021.csv \
-k confirm-or-change-email-address \
-f input/remove_from_webportal.txt \
-e true \
-v first-name: \
-v last-name:
# Convert to ProfilesRNS data..
python csv-convert.py \
-i tmp/ngscn2021.csv \
-o output
python csv-keyphrases.py \
-i tmp/ngscn2021.csv \
-o output/keyphrases.txt \
-k affiliation-strings-for-searches-in-pubmed
it breaks here:
(base) succurro@antigone:~/repositories/publicgithub/limes/src/Utils$ python csv-merge.py \
-ia tmp/ngscn2021_full_form_20211117.clean.csv \
-ib tmp/ngscn2021_reduced_form_20211117.clean.csv \
-o tmp/ngscn2021.merge.csv \
-k confirm-or-change-email-address \
-d ngs-cc-affiliation:'WGGC' \
-d hide-contact-information:False \
-m buidling:building \
-m primary-affiliation:affiliation \
-m specify-affiliation-1:specify-affiliation \
-m your-role-s-at-the-wggc:your-role-s-at-the-ngs-cc \
-m specify-your-role-s-at-the-wggc:specify-your-role-s \
-m specify-city:city \
-m you-agree-to-publish-your-data-on-the-ngs-cn-profiles-web-portal-if-affiliated-with-wggc-also-on-the-wggc-profiles-web-portal:you-agree-to-our-terms-and-policy
Exception ignored in: <function CsvReader.__del__ at 0x7fdb3544e510>
Traceback (most recent call last):
File "/home/succurro/repositories/publicgithub/limes/src/Utils/lib/csv_core.py", line 49, in __del__
self.file.close()
AttributeError: 'CsvReader' object has no attribute 'file'
(base) succurro@antigone:~/repositories/publicgithub/limes/src/Utils$ ll tmp/ngscn2021_full_form_20211117.clean.csv
ls: cannot access 'tmp/ngscn2021_full_form_20211117.clean.csv': No such file or directory
(base) succurro@antigone:~/repositories/publicgithub/limes/src/Utils$ ll tmp/ngscn2021_reduced_form_20211117.clean.csv
ls: cannot access 'tmp/ngscn2021_reduced_form_20211117.clean.csv': No such file or directory
Going back to the log when those files should have been produced:
(base) succurro@antigone:~/repositories/publicgithub/limes/src/Utils$ more tmp/ngscn2021_full_form_20211117.dups.csv
Traceback (most recent call last):
File "csv-clean.py", line 16, in <module>
reader = CsvReader(args.input, args.key)
File "/home/succurro/repositories/publicgithub/limes/src/Utils/lib/csv_core.py", line 46, in __init__
raise Exception(f'Key column {keyfield} not found in file {filepath}')
Exception: Key column confirm-or-change-email-address not found in file input/ngscn2021_full_form_20211117.csv
Hi Antonella,
which Python version are you using?
Copied the wrong output, see edited error above, in the base env I run python 3.7.3
(base) succurro@antigone:~/repositories/publicgithub/limes/src/Utils$ python
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Would you be available for a quick call?
Found it - some files have "confirm-or-change-email-address" and some "confirm-or-change-your-email-address", fixed with:
(base) succurro@antigone:~/repositories/publicgithub/limes/src/Utils/input$ sed -i 's/confirm-or-change-your-email-address/confirm-or-change-email-address/g' *.csv
Closing this issue
Yes. Just noticed that the error must have something to do with the column name. Excellent. Good job! :)
However, error handling could be improved. We should add a try/catch block around the main loop and pretty print the error so the actual message does not get lost in the bulk of stack traces.
Actually it would be quite nice to refactor bash script into a python script which can be step debugged..
Error when running the cleaning scripts on the input csv files:
Conda base environment (Python 3.7.3 (default, Mar 27 2019, 22:11:17) )
input files: