globaldothealth / list

Repository for Global.health: a data science initiative to enable rapid sharing of trusted and open public health data to advance the response to infectious diseases.
MIT License
39 stars 7 forks source link

Automatic importer: Ohio case data #477

Closed iamleeg closed 2 years ago

iamleeg commented 4 years ago

https://coronavirus.ohio.gov/static/COVIDSummaryData.csv seems to be off-line?

rahul18cracker commented 4 years ago

Hi

This seems to be static data, do you want me to pull the csv file everytime from the link once the parser is run ? Then i will be able to open this file and extract data ?

Thanks ~Rahul

iamleeg commented 4 years ago

There's a separate fetcher function that downloads the data to S3. Your parser just needs to pick up the file from S3 and interpret the content: if the file doesn't update, the fetcher won't trigger it more than once.

rahul18cracker commented 4 years ago

Hi Graham

Sorry i was not able to work on this, now i have time. i went thru the videos from Alex explaining Japan case data and how he copies the json file manually to the s3 bucket. What you are saying above is this would be done by the fetcher function. Can you please give me some examples where i can see this fetcher function at work, how to trigger it and how to see that in the S3 bucket?

I am also planning to take the url for Brazil https://extranet.saude.go.gov.br/pentaho/api/repo/:coronavirus:paineis:painel.wcdf/generatedContent

Will a URL based one also work with the fetcher, basically i am very new to this and i dint find examples where i can see how the fetcher extracts data and how do i get data out from the fetcher. I watched Alex's Japan case data video multiple times to figure out but was not successful

Thanks ~Rahul

iamleeg commented 4 years ago

@rahul18cracker this file documents the retrieval function and how to invoke it, and also how to set up the data source in your dev stack.

rahul18cracker commented 4 years ago

Hi @iamleeg : I was able to setup things locally to the place where when i do the fist sam build it shows me this error . Now i am not sure which file is it talking about . I am following the Mac laptop based guidelines. My json file looks like this, is it talking about the JSON file or something else ?

File : valid_scheduled_event.json { "env": "local", "sourceId": "5f589b22d9a72c0028ec668b" } "auth": { "email": "local@ingestion.function" } { "env": "brazil", "sourceId": "5f67dc50ace517002f421ebe" }

Error --- {"errorType":"Runtime.UnmarshalError","errorMessage":"Unable to unmarshal input: Extra data: line 5 column 1 - line 11 column 2 (char 67 - 185)"}

Complete logs ------ (opencovidenv) RAHMATHU-M-D0KK:functions rahmathu$ sam suild 2020-09-20 16:13:18 Command suild not available Usage: sam [OPTIONS] COMMAND [ARGS]... Try 'sam --help' for help.

Error: No such command 'suild'. (opencovidenv) RAHMATHU-M-D0KK:functions rahmathu$ sam build Building function 'RetrievalFunction' Running PythonPipBuilder:ResolveDependencies Running PythonPipBuilder:CopySource Building function 'IndiaParsingFunction' Running PythonPipBuilder:ResolveDependencies Running PythonPipBuilder:CopySource Building function 'HongKongParsingFunction' Running PythonPipBuilder:ResolveDependencies Running PythonPipBuilder:CopySource Building function 'JapanParsingFunction' Running PythonPipBuilder:ResolveDependencies Running PythonPipBuilder:CopySource Building function 'CHZurichParsingFunction' Running PythonPipBuilder:ResolveDependencies Running PythonPipBuilder:CopySource Building layer 'ParsingLibLayer' Running PythonPipBuilder:ResolveDependencies Running PythonPipBuilder:CopySource Building layer 'CommonLibLayer' Running PythonPipBuilder:ResolveDependencies Running PythonPipBuilder:CopySource

Build Succeeded

Built Artifacts : .aws-sam/build Built Template : .aws-sam/build/template.yaml

Commands you can use next

[] Invoke Function: sam local invoke [] Deploy: sam deploy --guided

(opencovidenv) RAHMATHU-M-D0KK:functions rahmathu$ sam local invoke "RetrievalFunction" -e retrieval/valid_scheduled_event.json --docker-network=host Invoking retrieval.lambda_handler (python3.8) CommonLibLayer is a local Layer in the template Image was not found. Building image............................................................................................................................................................................................................................................................................................................................................................................. Skip pulling image and use local one: samcli/lambda:python3.8-9576495e2926b8d9e213a58a6.

Mounting /Users/rahmathu/Documents/personal_projects/open_covid_work/list/ingestion/functions/.aws-sam/build/RetrievalFunction as /var/task:ro,delegated inside runtime container START RequestId: 25b44020-3f7a-1533-7939-fd14260c6d51 Version: $LATEST [ERROR] Runtime.UnmarshalError: Unable to unmarshal input: Extra data: line 5 column 1 - line 11 column 2 (char 67 - 185) END RequestId: 25b44020-3f7a-1533-7939-fd14260c6d51 REPORT RequestId: 25b44020-3f7a-1533-7939-fd14260c6d51 Init Duration: 3165.00 ms Duration: 4.14 ms Billed Duration: 100 ms Memory Size: 128 MB Max Memory Used: 47 MB

{"errorType":"Runtime.UnmarshalError","errorMessage":"Unable to unmarshal input: Extra data: line 5 column 1 - line 11 column 2 (char 67 - 185)"}

attwad commented 4 years ago

"line 5 column 1" in the error message leads to an invalid json syntax yes, the file should look like:

{
"env": "local",
"sourceId": "5f589b22d9a72c0028ec668b",
"auth": {
  "email": "local@ingestion.function"
}

also "env": "brazil" doesn't mean anything, envs are either local, prod or dev. Was that somewhere in the docs?

rahul18cracker commented 4 years ago

Hi @attwad

Thanks for taking the time to reply to this and suggesting a fix. It was in the docs, after creating the email it mentioned that generate and ID and put that in the json file . I named the source as "brazil" and took the id "5f67dc50ace517002f421ebe", was trying to add it to the JSON file.

From the readme ---------------- Go to the UI at http://localhost:3002/sources and add a new source for your parser, once you give it a name, a URL and save it, it will be given an ID.

Put that ID in the retrieval/valid_scheduled_event.json file.

How can i add this new source ? with source id "sourceId": "5f67dc50ace517002f421ebe". If i add it with this added line JSON validation fails as 2 keys with same is are there. I am not able to find an example how it was added by others in the tutorials also

Thanks ~Rahul

attwad commented 4 years ago

The retrieval/valid_scheduled_event.json is already a "valid" file as the name implies so you need to change the existing ID that's in there, not "add" a new ID to it.

rahul18cracker commented 4 years ago

Thanks a lot, let me do that.

rahul18cracker commented 4 years ago

Hi @attwad

Thanks for all the help above, i am now able to trigger the retrieval for the source. I am now stuck on a new problem, after the retrieval it's trying to upload it to an S3 bucket. Now when i had requested an S3 key and id from Alex he mentioned that for local testing i would not need one. From the sam cli it looks like it get's to an exception when it's not able to upload the data to S3 due to key problem. Now i am not sure how to get the keys or test it in local mode where i can put it in a temp folder and then read from there(i can modify the code to do that but i want to stick to the regular path). Can you please let me know how to get around this ?

Main error according to me Failed to upload /tmp/content.json to epid-sources-raw/5f67dc50ace517002f421ebe/2020/09/28/0444/content.json: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. Updating upload via http://localhost:3001/api/sources/5f67dc50ace517002f421ebe/uploads/5f716a0d198eb20044c43d20 [ERROR] S3UploadFailedError: Failed to upload /tmp/content.json to epid-sources-raw/5f67dc50ace517002f421ebe/2020/09/28/0444/content.json: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.

Complete logs ====== (opencovidenv) RAHMATHU-M-D0KK:functions rahmathu$ sam local invoke "RetrievalFunction" -e retrieval/valid_scheduled_event.json --docker-network=host Invoking retrieval.lambda_handler (python3.8) CommonLibLayer is a local Layer in the template Building image........... Skip pulling image and use local one: samcli/lambda:python3.8-9576495e2926b8d9e213a58a6.

Mounting /Users/rahmathu/Documents/personal_projects/open_covid_work/list/ingestion/functions/.aws-sam/build/RetrievalFunction as /var/task:ro,delegated inside runtime container START RequestId: b483190d-5bb6-1fec-2661-b8540e790b9c Version: $LATEST Extracting fields from event {'env': 'local', 'sourceId': '5f67dc50ace517002f421ebe', 'auth': {'email': 'local@ingestion.function'}} Logging-in user local@ingestion.function Creating upload via http://localhost:3001/api/sources/5f67dc50ace517002f421ebe/uploads Requesting source configuration from http://localhost:3001/api/sources/5f67dc50ace517002f421ebe Received source API response: {'_id': '5f67dc50ace517002f421ebe', 'name': 'Brazil data', 'origin': {'_id': '5f67dc50ace517002f421ebf', 'url': 'https://extranet.saude.go.gov.br/pentaho/api/repos/:coronavirus:paineis:painel.wcdf/generatedContent'}, 'format': 'JSON', 'uploads': [{'created': '2020-09-28T04:43:57.375Z', '_id': '5f716a0d198eb20044c43d20', 'status': 'IN_PROGRESS', 'summary': {}}], '__v': 1} Downloading JSON content from https://extranet.saude.go.gov.br/pentaho/api/repos/:coronavirus:paineis:painel.wcdf/generatedContent Download finished detecting encoding of retrieved content. Source encoding is presumably {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''} Failed to upload /tmp/content.json to epid-sources-raw/5f67dc50ace517002f421ebe/2020/09/28/0444/content.json: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. Updating upload via http://localhost:3001/api/sources/5f67dc50ace517002f421ebe/uploads/5f716a0d198eb20044c43d20 [ERROR] S3UploadFailedError: Failed to upload /tmp/content.json to epid-sources-raw/5f67dc50ace517002f421ebe/2020/09/28/0444/content.json: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. Traceback (most recent call last):   File "/var/task/retrieval.py", line 240, in lambda_handler     upload_to_s3(file_name, s3_object_key, env,   File "/var/task/retrieval.py", line 143, in upload_to_s3     common_lib.complete_with_error(   File "/opt/python/common_lib.py", line 88, in complete_with_error     raise exception   File "/var/task/retrieval.py", line 138, in upload_to_s3     s3_client.upload_file(   File "/var/task/boto3/s3/inject.py", line 129, in upload_file     return transfer.upload_file(   File "/var/task/boto3/s3/transfer.py", line 285, in upload_file     raise S3UploadFailedError( END RequestId: b483190d-5bb6-1fec-2661-b8540e790b9c REPORT RequestId: b483190d-5bb6-1fec-2661-b8540e790b9c Init Duration: 3437.48 ms Duration: 4595.98 ms Billed Duration: 4600 ms Memory Size: 128 MB Max Memory Used: 51 MB

{"errorType":"S3UploadFailedError","errorMessage":"Failed to upload /tmp/content.json to epid-sources-raw/5f67dc50ace517002f421ebe/2020/09/28/0444/content.json: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.","stackTrace":[" File \"/var/task/retrieval.py\", line 240, in lambda_handler\n upload_to_s3(file_name, s3_object_key, env,\n"," File \"/var/task/retrieval.py\", line 143, in upload_to_s3\n common_lib.complete_with_error(\n"," File \"/opt/python/common_lib.py\", line 88, in complete_with_error\n raise exception\n"," File \"/var/task/retrieval.py\", line 138, in upload_to_s3\n s3_client.upload_file(\n"," File \"/var/task/boto3/s3/inject.py\", line 129, in upload_file\n return transfer.upload_file(\n"," File \"/var/task/boto3/s3/transfer.py\", line 285, in upload_file\n raise S3UploadFailedError(\n"]}

attwad commented 4 years ago

Hi, local authentication doesn't require access to serialized credentials on s3 anymore that's right (you're already using the "auth" field in the event which does just that, good) but storage of retrieved files still requires some s3 access.

I recommend just setting up an AWS project of your own to test the flow end-to-end, it fits in the free tier (5Gb/12months). Or if you want to add support for local files downloads this could also work and would be a good future-proof alternative helping other contributors (I filed https://github.com/globaldothealth/list/issues/1236 just now to track this effort).

rahul18cracker commented 4 years ago

HI @attwad

Thanks for the quick answer, at this point of time since i am new and struggling to get the 1st one working. What i would do as per your suggestion is to set up a S3 bucket using the free tier. Once i am able to set up the bucket and upload and pull data from it, this would let me verify the flow and get things working.

Once done i can look at #1236 and see if i am able to help

Thanks ~Rahul

rahul18cracker commented 4 years ago

Hi @attwad

I tried doing the above with setting up a free S3 bucket as suggest by you above. I see the credentials problems when i modify the files to take my bucket "epid-sources-raw-rahul", i have modified the following files to point to my own bucket.

I tried to go thru the code and in the retrival.py file i see s3_client = boto3.client("s3") >> i tried hunting how this file specifies the credentials and it looks like they come from .yml files. I am not able to locate the reason why the credentials do not work in the sam cli invoked functions but work well with the same test code i have done. Please see all the details below

In files retrival.py OUTPUT_BUCKET = "epid-sources-raw-rahul" >> this was previously OUTPUT_BUCKET = "epid-sources-raw" In file ingestion/functions/template.yaml - line 16 BucketName: epid-sources-raw-rahul >> Previously this was BucketName: epid-sources-raw

I have updated my .env file with (please note for security reasons i have ommitted the read keys and secrets ) (opencovidenv) RAHMATHU-M-D0KK:functions rahmathu$ cat ../../dev/.env AWS_ACCESS_KEY_ID=my-aws-iam-user-key AWS_SECRET_ACCESS_KEY=my-aws-iam-user-secret-key

So then i tried a small program with just accessing the S3 bucket and it works with the credentials i have provided

Code ----------- import boto3 s3 = boto3.client('s3', aws_access_key_id=my-aws-iam-user-key, aws_secret_access_key=my-aws-iam-user-secret-key, region_name='us-east-1') response = s3.list_buckets()

Output print('Existing buckets:') for bucket in response['Buckets']: print(f' {bucket["Name"]}')

OUtput for this above code /Users/rahmathu/Documents/personal_projects/code-test/venv/bin/python /Users/rahmathu/Documents/personal_projects/code-test/test-code.py Existing buckets: epid-sources-raw-rahul

Process finished with exit code 0

Error log i see when i do execute the sam retrieve (opencovidenv) RAHMATHU-M-D0KK:functions rahmathu$ sam local invoke "RetrievalFunction" -e retrieval/valid_scheduled_event.json --docker-network=host Invoking retrieval.lambda_handler (python3.8) CommonLibLayer is a local Layer in the template Building image............. Skip pulling image and use local one: samcli/lambda:python3.8-9576495e2926b8d9e213a58a6.

Mounting /Users/rahmathu/Documents/personal_projects/open_covid_work/list/ingestion/functions/.aws-sam/build/RetrievalFunction as /var/task:ro,delegated inside runtime container START RequestId: 71cb006d-2007-1b3b-a1f2-54d51a675cd9 Version: $LATEST Extracting fields from event {'env': 'local', 'sourceId': '5f67dc50ace517002f421ebe', 'auth': {'email': 'local@ingestion.function'}} Logging-in user local@ingestion.function Creating upload via http://localhost:3001/api/sources/5f67dc50ace517002f421ebe/uploads Requesting source configuration from http://localhost:3001/api/sources/5f67dc50ace517002f421ebe Received source API response: {'_id': '5f67dc50ace517002f421ebe', 'name': 'Brazil data', 'origin': {'_id': '5f67dc50ace517002f421ebf', 'url': 'https://extranet.saude.go.gov.br/pentaho/api/repos/:coronavirus:paineis:painel.wcdf/generatedContent'}, 'format': 'JSON', 'uploads': [{'created': '2020-09-28T04:43:57.375Z', '_id': '5f716a0d198eb20044c43d20', 'status': 'ERROR', 'summary': {'error': 'INTERNAL_ERROR'}}, {'created': '2020-10-04T20:45:53.107Z', '_id': '5f7a3481a13a6c0039d56874', 'status': 'ERROR', 'summary': {'error': 'INTERNAL_ERROR'}}, {'created': '2020-10-04T20:54:08.699Z', '_id': '5f7a3670a13a6c0039d56879', 'status': 'IN_PROGRESS', 'summary': {}}], '__v': 3} Downloading JSON content from https://extranet.saude.go.gov.br/pentaho/api/repos/:coronavirus:paineis:painel.wcdf/generatedContent Download finished detecting encoding of retrieved content. Source encoding is presumably {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''} Failed to upload /tmp/content.json to epid-sources-raw-rahul/5f67dc50ace517002f421ebe/2020/10/04/2054/content.json: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. Updating upload via http://localhost:3001/api/sources/5f67dc50ace517002f421ebe/uploads/5f7a3670a13a6c0039d56879 [ERROR] S3UploadFailedError: Failed to upload /tmp/content.json to epid-sources-raw-rahul/5f67dc50ace517002f421ebe/2020/10/04/2054/content.json: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. Traceback (most recent call last):   File "/var/task/retrieval.py", line 240, in lambda_handler     upload_to_s3(file_name, s3_object_key, env,   File "/var/task/retrieval.py", line 143, in upload_to_s3     common_lib.complete_with_error(   File "/opt/python/common_lib.py", line 88, in complete_with_error     raise exception   File "/var/task/retrieval.py", line 138, in upload_to_s3     s3_client.upload_file(   File "/var/task/boto3/s3/inject.py", line 129, in upload_file     return transfer.upload_file(   File "/var/task/boto3/s3/transfer.py", line 285, in upload_file     raise S3UploadFailedError( END RequestId: 71cb006d-2007-1b3b-a1f2-54d51a675cd9 REPORT RequestId: 71cb006d-2007-1b3b-a1f2-54d51a675cd9 Init Duration: 3526.08 ms Duration: 7439.97 ms Billed Duration: 7500 ms Memory Size: 128 MB Max Memory Used: 52 MB

{"errorType":"S3UploadFailedError","errorMessage":"Failed to upload /tmp/content.json to epid-sources-raw-rahul/5f67dc50ace517002f421ebe/2020/10/04/2054/content.json: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.","stackTrace":[" File \"/var/task/retrieval.py\", line 240, in lambda_handler\n upload_to_s3(file_name, s3_object_key, env,\n"," File \"/var/task/retrieval.py\", line 143, in upload_to_s3\n common_lib.complete_with_error(\n"," File \"/opt/python/common_lib.py\", line 88, in complete_with_error\n raise exception\n"," File \"/var/task/retrieval.py\", line 138, in upload_to_s3\n s3_client.upload_file(\n"," File \"/var/task/boto3/s3/inject.py\", line 129, in upload_file\n return transfer.upload_file(\n"," File \"/var/task/boto3/s3/transfer.py\", line 285, in upload_file\n raise S3UploadFailedError(\n"]}

Thanks ~Rahul

attwad commented 4 years ago

In your small exampe you are specifying the s3 key id and access key explicitly whereas in the retrieval function code it's just using the default s3 client constructor, the differenc emight come from here. I believe this uses your default local creds that are configured with aws configure, did you run that command and enter your access key and id there?

rahul18cracker commented 4 years ago

Hi @attwad

Thanks for getting back to me, i have credentials specified in the .env file which is present in the "list/dev" folder as mentioned in the setup for this project. I thought when i put my latest credentials for my AWS Iam user there it would pick up from there. It looks like it's taking from some place else which i am not able to figure out. I tried looking at where this aws config would be and it should be at location ~/.aws/config but i dont have such a folder, i only have .aws-sam. I was trying to add prints in the code where it's taking this AWS id and key to see what's the value it's taking but i was not able to find that except in the .yml files. If you can point me to the code where it's taking these values from the .env file or some other source i can add some debug to see what's the key it's taking and why not the one from .env file.

(opencovidenv) RAHMATHU-M-D0KK:dev rahmathu$ ls ~/.a* /Users/rahmathu/.anyconnect

/Users/rahmathu/.anaconda: navigator

/Users/rahmathu/.android: adbkey adbkey.pub

/Users/rahmathu/.astropy: config

/Users/rahmathu/.aws-sam: layers-pkg metadata.json

Thanks ~Rahul

attwad commented 4 years ago

The /dev/.env file is used by docker when you run the stack locally but SAM uses a different stack which ignores this env file. So when you talk to amazon it uses the creds usually found in ~/.aws/credentials, to generate this directory and its content just download the aws CLI and run aws config

rahul18cracker commented 4 years ago

Hi @attwad

Thanks i was not aware of this, let me try these steps to see if i can get it to work

Thanks ~Rahul

rahul18cracker commented 4 years ago

Hi @attwad

Thanks for the help, i was able to upload it to my local S3 setup bucket now. Would you like me to update the readme for this using the issue #1236 ?

My successful output

Invoking retrieval.lambda_handler (python3.8) CommonLibLayer is a local Layer in the template Building image............ Skip pulling image and use local one: samcli/lambda:python3.8-9576495e2926b8d9e213a58a6.

Mounting /Users/rahmathu/Documents/personal_projects/open_covid_work/list/ingestion/functions/.aws-sam/build/RetrievalFunction as /var/task:ro,delegated inside runtime container START RequestId: cf6476db-f62e-1490-0bb5-cb043954df25 Version: $LATEST Extracting fields from event {'env': 'local', 'sourceId': '5f67dc50ace517002f421ebe', 'auth': {'email': 'local@ingestion.function'}} Logging-in user local@ingestion.function Creating upload via http://localhost:3001/api/sources/5f67dc50ace517002f421ebe/uploads Requesting source configuration from http://localhost:3001/api/sources/5f67dc50ace517002f421ebe Received source API response: {'_id': '5f67dc50ace517002f421ebe', 'name': 'Brazil data', 'origin': {'_id': '5f67dc50ace517002f421ebf', 'url': 'https://extranet.saude.go.gov.br/pentaho/api/repos/:coronavirus:paineis:painel.wcdf/generatedContent'}, 'format': 'JSON', 'uploads': [{'created': '2020-09-28T04:43:57.375Z', '_id': '5f716a0d198eb20044c43d20', 'status': 'ERROR', 'summary': {'error': 'INTERNAL_ERROR'}}, {'created': '2020-10-04T20:45:53.107Z', '_id': '5f7a3481a13a6c0039d56874', 'status': 'ERROR', 'summary': {'error': 'INTERNAL_ERROR'}}, {'created': '2020-10-04T20:54:08.699Z', '_id': '5f7a3670a13a6c0039d56879', 'status': 'ERROR', 'summary': {'error': 'INTERNAL_ERROR'}}, {'created': '2020-10-10T21:38:28.838Z', '_id': '5f8229d4ea4e6300458a49e9', 'status': 'ERROR', 'summary': {'error': 'INTERNAL_ERROR'}}, {'created': '2020-10-10T21:50:04.601Z', '_id': '5f822c8cea4e6300458a49ee', 'status': 'ERROR', 'summary': {'error': 'INTERNAL_ERROR'}}, {'created': '2020-10-10T22:01:48.634Z', '_id': '5f822f4cea4e6300458a49f3', 'status': 'IN_PROGRESS', 'summary': {}}], '__v': 6} Downloading JSON content from https://extranet.saude.go.gov.br/pentaho/api/repos/:coronavirus:paineis:painel.wcdf/generatedContent Download finished detecting encoding of retrieved content. Source encoding is presumably {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''} Uploaded source content to s3://epid-sources-raw-rahul/5f67dc50ace517002f421ebe/2020/10/10/2201/content.json END RequestId: cf6476db-f62e-1490-0bb5-cb043954df25 REPORT RequestId: cf6476db-f62e-1490-0bb5-cb043954df25 Init Duration: 3115.39 ms Duration: 8545.15 ms Billed Duration: 8600 ms Memory Size: 128 MB Max Memory Used: 51 MB

{"bucket":"epid-sources-raw-rahul","key":"5f67dc50ace517002f421ebe/2020/10/10/2201/content.json","upload_id":"5f822f4cea4e6300458a49f3"} (opencovidenv) RAHMATHU-M-D0KK:functions rahmathu$

Thanks ~Rahul

rahul18cracker commented 4 years ago

Hi @attwad

I got the parser retrival working but it's pulling the html page from the website . What i expected was it to extract that data from the html page and pack it in Json. I am not sure why this ingestion function wont do that ,do i have to parse this html page using selenium or BeautifulSoup first ?

Thanks ~Rahul

Eg: I am putting this link https://extranet.saude.go.gov.br/pentaho/api/repos/:coronavirus:paineis:painel.wcdf/generatedContent and in the S3 bucket i see the S3 page extracted and not values about cases in the json file

https://epid-sources-raw-rahul.s3.us-east-2.amazonaws.com/5f67dc50ace517002f421ebe/2020/10/10/2201/content.json

attwad commented 4 years ago

Yes changing the readme with your new function would be great, thank you.

You need to have a source URL whose content is json or csv yes, the ingestion function just retrieves the content as is it doesn't call out to any other extracting libraries like beautiful soup, this issue is about Ohio case data but you seem to be using a link for Brazil? Why is that?

rahul18cracker commented 4 years ago

Hi @attwad

I dont have a function where i have done these changes, let me look at putting this to a function for that issue. I just made local changes to check if i can get the stack to work.

I was trying to see how to read 3 different types of formats. I see that in the video examples json is used, here it's csv which is also easier with this retrival function. How do you guys do it for HTML pages ? I have not seen an example for that . Can you please point me to one ?

I would like to do this and the Brazil case.

Thanks ~Rahul

attwad commented 4 years ago

Sources should use structured data whenever possible, parsing html is going to be really flaky, if the first link Graham provided is offline now perhaps Ohio offers a new dataset we just have to find it. https://github.com/globaldothealth/list/issues/477#issue-654123735

attwad commented 4 years ago

Also Brazil is already behing parsed for some states using structured data, please keep this issue to its original Ohio case.

rahul18cracker commented 4 years ago

Hi @attwad

I started to look at this csv file for Ohio case data and followed this link to convert CSV to the JSON format to make sure ingestion functions can read this. I am a bit stuck again where

https://github.com/globaldothealth/list/blob/main/data-serving/scripts/convert-data/README.md

Here it's asking to use a source id to trigger the specifc request and convert the CSV data to JSON format. When i do it on my laptop i see that the script is not accepting the source id arguement as mentioned in the readme. Now even if i trigger it without the argument i see it's trying to unzip some file latestdata.tar.gz

Can you please help me which guide do i follow to convert this ohio case data in this issue to JSON format. I would then be able to trigger a retirival and work on adding the parser.

(opencovidenv) (base) RAHMATHU-M-D0KK:convert-data rahmathu$ python convert_data.py --ncov2019_path=/Users/rahmathu/Documents/personal_projects/open_covid_work --source_id=5f67dc50ace517002f421ebe --outfile=cases.json --sample_rate=.1 usage: convert_data.py [-h] --ncov2019_path NCOV2019_PATH --outfile OUTFILE [--sample_rate SAMPLE_RATE]

without the argument (opencovidenv) (base) RAHMATHU-M-D0KK:convert-data rahmathu$ python convert_data.py --ncov2019_path=/Users/rahmathu/Documents/personal_projects/open_covid_work --outfile=cases.json --sample_rate=.1 Unzipping /Users/rahmathu/Documents/personal_projects/open_covid_work/latest_data/latestdata.tar.gz Traceback (most recent call last): File "convert_data.py", line 176, in main() File "convert_data.py", line 39, in main csv_path = extract_csv(args.ncov2019_path) File "convert_data.py", line 67, in extract_csv latest_data_gzip = tarfile.open(gzip_path) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/tarfile.py", line 1599, in open return func(name, "r", fileobj, **kwargs) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/tarfile.py", line 1664, in gzopen fileobj = gzip.GzipFile(name, mode + "b", compresslevel, fileobj) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/gzip.py", line 173, in init fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') FileNotFoundError: [Errno 2] No such file or directory: '/Users/rahmathu/Documents/personal_projects/open_covid_work/latest_data/latestdata.tar.gz'

attwad commented 4 years ago

Sorry I'm not sure I follow you anymore, why are you manually converting ohio case data to json? The convert_data.py script you seem to be referring to is used to convert spreadsheet data from an old system it's not a generic convert-anything script. Please kindly indicate where you're getting the ohio data from and let's try looking for a structured version of it, I haven't seen any official source so far that wasn't provided in a structured format (aka. not HTML).

rahul18cracker commented 4 years ago

HI @attwad

I looked at the link in the issue https://coronavirus.ohio.gov/static/COVIDSummaryData.csv and on downloading this file i see it's not in the structure/format what alex highlights in the tutorial videos or what other parsers are expecting. The data in this CSV file contains different fields. I see that you have mentioned that this link seems offline, not sure what you meant. Was it that this link's data is not updated ?

In case i need to follow a new link for Ohio data please let me know how/where to look for it and i can update the link for this issue. If there is another way to do it please let me know.

For the convert_data.py i had assumed this is a generic converter and we had to use this, sorry i dint know it's not useful anymore. Thanks for the clarification

Thanks ~Rahul

attwad commented 4 years ago

This is what I see on that link: image but it seems to be indexed by google search properly so perhaps that's just a (stupid) geographical restriction that's set up and I'm unable to see the files... I've asked US-based folks to check that link for me.

If all sources were using the same CSV fields we wouldn't need our project really :) This is what the parser should do: convert from "their" fields to ours. The format that is expected is in the docs: https://github.com/globaldothealth/list/tree/main/ingestion/functions#writing-a-parser

rahul18cracker commented 4 years ago

Thanks i will convert it to that format from the CSV, i think it's location based filtering that you cannot see that data.

abhidg commented 2 years ago

URL is 404 Dataset, and we have full data from CDC (albeit without location).