Set up cron to run uploader

apaala commented 4 years ago

Set up cron job once it is confirmed that we are good to go and the log file component is ready

apaala commented 4 years ago

This is blocked by the upload to google cloud functionality not working. Log file processing has been implemented. Unable to test start to finish.

apaala commented 4 years ago

@jorvis

Here is the command I am using, maybe it will work for you: /usr/local/common/Python-3.7.2/bin/python3 /local/projects-t2/achatterjee/analytics/cron_uploader/nemo_upload_crawler.py -ilb /local/scratch/achatterjee/NEMO/Converter/2IN/ -ob /local/scratch/achatterjee/NEMO/Converter/2Out/

The error I get when I try to do the upload: INFO: Uploading these files to the cloud bucket: /local/scratch/achatterjee/NEMO/Converter/2Out/6d6a645f-4493-4115-b7a2-833b4caa26fb.h5ad, /local/scratch/achatterjee/NEMO/Converter/2Out//6d6a645f-4493-4115-b7a2-833b4caa26fb.json ERROR: Failed to process file:/local/scratch/achatterjee/NEMO/BrainSpanBulkDevo.tar.gz 2:11 the log file lives here: /local/projects-t3/NEMO/cron_upload_log/

apaala commented 4 years ago

The cron shell script that calls the uploader is in place here /local/projects-t3/NEMO/cron_upload_log/. We need to decide where the output files (h5ad) will be stored before uploading to server.

apaala commented 4 years ago

@jorvis I am not sure if this is ready to be put in place yet, are we in a position to set up the cron? @carlocolantuoni was asking about some datasets he needs uploaded and I am trying to figure out if I should just run it manually or wait for the cron...

jorvis commented 4 years ago

As long as logging is in place where we can undo operations as needed, it's fine to install the cron.

apaala commented 4 years ago

@jorvis ok I will talk to @victor73 about getting it in place.

apaala commented 4 years ago

@carlocolantuoni I have set up the cron, it will run at 2:30 am everyday. I will check the logs to see if it worked alright tomorrow and update

carlocolantuoni commented 4 years ago

great - thanks!

On Thu, Apr 2, 2020 at 2:57 PM apaala notifications@github.com wrote:

@carlocolantuoni https://github.com/carlocolantuoni I have set up the cron, it will run at 2:30 am everyday. I will check the logs to see if it worked alright tomorrow and update

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nemoarchive/analytics/issues/90#issuecomment-608043790, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7UG45R6ZZPDAODVZVLRKTNZHANCNFSM4KP275AQ .

-- Carlo

apaala commented 4 years ago

Got an error when it tried to run the command setup in cron:

Traceback (most recent call last): File "/usr/local/common/Python-3.7.2/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2656, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'Type'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/local/projects-t3/NEMO/bin/cron/analytics/cron_uploader/nemo_upload_crawler.py", line 311, in main() File "/local/projects-t3/NEMO/bin/cron/analytics/cron_uploader/nemo_upload_crawler.py", line 71, in main files_pending = get_datasets_to_process(args.input_log_base, args.output_base, processed_logfile) File "/local/projects-t3/NEMO/bin/cron/analytics/cron_uploader/nemo_upload_crawler.py", line 223, in get_datasets_to_process hold_relevant_entries = read_log_file.loc[read_log_file['Type'].isin(formats)] File "/usr/local/common/Python-3.7.2/lib/python3.7/site-packages/pandas/core/frame.py", line 2927, in getitem indexer = self.columns.get_loc(key) File "/usr/local/common/Python-3.7.2/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2658, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'Type'

@jorvis @carlocolantuoni Got this error. Not sure what the problem was... Looks like it was not able to find a header in the file but the file seems to have it.

apaala commented 4 years ago

The cron error was:

/local/projects-t3/NEMO/bin/cron/uploader.sh: line 3: gcloud: command not found Traceback (most recent call last): File "/local/projects-t3/NEMO/bin/cron/analytics/cron_uploader/nemo_upload_crawler.py", line 37, in from gear.dataarchive import DataArchive File "/local/projects-t3/NEMO/bin/cron/gEAR/lib/gear/dataarchive.py", line 5, in import scanpy.api as sc File "/usr/local/common/Python-3.7.2/lib/python3.7/site-packages/scanpy/init.py", line 31, in from . import tools as tl File "/usr/local/common/Python-3.7.2/lib/python3.7/site-packages/scanpy/tools/init.py", line 12, in from ._sim import sim File "/usr/local/common/Python-3.7.2/lib/python3.7/site-packages/scanpy/tools/_sim.py", line 19, in from .. import readwrite File "/usr/local/common/Python-3.7.2/lib/python3.7/site-packages/scanpy/readwrite.py", line 9, in import tables File "/usr/local/common/Python-3.7.2/lib/python3.7/site-packages/tables/init.py", line 93, in from .utilsextension import ( ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /usr/local/common/Python-3.7.2/lib/python3.7/site-packages/tables/utilsextension.cpython-37m-x86_64-linux-gnu.so)

apaala commented 4 years ago

@jorvis Resolved the first error by adding *diff.log as the identifying string. There was a log file that was of a different format that threw the code off... Now the processing stops due to not being able to find a file as it may have been moved.

INFO: Extracting dataset at path: /local/projects-t3/NEMO/dmz/brain/biccn/grant/devhu/transcriptome/scell/processed/counts/GW22_somato2/GW22_somato2.mex.tar.gz Traceback (most recent call last): File "/local/projects-t3/NEMO/bin/cron/analytics/cron_uploader/nemo_upload_crawler.py", line 311, in main() File "/local/projects-t3/NEMO/bin/cron/analytics/cron_uploader/nemo_upload_crawler.py", line 76, in main dataset_dir = extract_dataset(file_path, args.output_base) File "/local/projects-t3/NEMO/bin/cron/analytics/cron_uploader/nemo_upload_crawler.py", line 173, in extract_dataset tar = tarfile.open(input_file_path) File "/usr/local/common/Python-3.7.2/lib/python3.7/tarfile.py", line 1573, in open return func(name, "r", fileobj, **kwargs) File "/usr/local/common/Python-3.7.2/lib/python3.7/tarfile.py", line 1638, in gzopen fileobj = gzip.GzipFile(name, mode + "b", compresslevel, fileobj) File "/usr/local/common/Python-3.7.2/lib/python3.7/gzip.py", line 163, in init fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') FileNotFoundError: [Errno 2] No such file or directory: '/local/projects-t3/NEMO/dmz/brain/biccn/grant/devhu/transcriptome/scell/processed/counts/GW22_somato2/GW22_somato2.mex.tar.gz'

carlocolantuoni commented 4 years ago

I didnt move any files ... don't know how that happened ... ?

On Fri, Apr 3, 2020, 10:30 apaala notifications@github.com wrote:

@jorvis https://github.com/jorvis Resolved the first error by adding *diff.log as the identifying string. There was a log file that was of a different format that threw the code off... Now the processing stops due to not being able to find a file as it may have been moved.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nemoarchive/analytics/issues/90#issuecomment-608465427, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7U43LMRZGW4PC4VFT3RKXXG7ANCNFSM4KP275AQ .

apaala commented 4 years ago

@carlocolantuoni it may not be your files that got moved... We are trying to process all datasets through the cron. Once that happens it will only process new datasets added.

carlocolantuoni commented 4 years ago

Hey apaala, could u let me kno wen the chron uploader is working, thnx

apaala commented 4 years ago

The cron error was:

/local/projects-t3/NEMO/bin/cron/uploader.sh: line 3: gcloud: command not found Traceback (most recent call last): File "/local/projects-t3/NEMO/bin/cron/analytics/cron_uploader/nemo_upload_crawler.py", line 37, in from gear.dataarchive import DataArchive File "/local/projects-t3/NEMO/bin/cron/gEAR/lib/gear/dataarchive.py", line 5, in import scanpy.api as sc File "/usr/local/common/Python-3.7.2/lib/python3.7/site-packages/scanpy/init.py", line 31, in from . import tools as tl File "/usr/local/common/Python-3.7.2/lib/python3.7/site-packages/scanpy/tools/init.py", line 12, in from ._sim import sim File "/usr/local/common/Python-3.7.2/lib/python3.7/site-packages/scanpy/tools/_sim.py", line 19, in from .. import readwrite File "/usr/local/common/Python-3.7.2/lib/python3.7/site-packages/scanpy/readwrite.py", line 9, in import tables File "/usr/local/common/Python-3.7.2/lib/python3.7/site-packages/tables/init.py", line 93, in from .utilsextension import ( ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /usr/local/common/Python-3.7.2/lib/python3.7/site-packages/tables/utilsextension.cpython-37m-x86_64-linux-gnu.so)

@adkinsrs This error is persisting, @victor73 said it seems like an environment issue.

adkinsrs commented 4 years ago

Was talking to @apaala about this ticket and #85 and given that the ingest scripts will be updated soon, it may be best to use the .diff files that are generated in the "dmz" area instead of the bundling output, since this file will make note of any files that may have moved in addition to newly bundled files.

adkinsrs commented 4 years ago

Currently I have not heard any additional plans to upload NeMO Archive data. We are still only uploading GEO submission data. So closing for now and will reopen if needed.

nemoarchive / analytics

Set up cron to run uploader #90