vamseeachanta / energydata

MIT License
0 stars 0 forks source link

BSEE | Data #9

Closed vamseeachanta closed 2 weeks ago

vamseeachanta commented 4 months ago

Objective: Get data from bsee website for further usage (view, process, analysis, etc.)

4 : well data story is done here.

Further data to be read, see below:

VA: Add documentation of all the potential data (Previous google documentation)

vamseeachanta commented 4 months ago

paleo data: format: fixed WIdth. Read from ACII FIles https://github.com/vamseeachanta/energydata/blob/bseedata/docs/raw_data/paleo

vamseeachanta commented 4 months ago

borehole data: api: 608174149400 url: https://www.data.boem.gov/Well/Borehole/Default.aspx/

vamseeachanta commented 4 months ago

This is the place all the urls and data is available. See if you can get this script to run and download data. This is data intensive. Hope you have a good internet connection and hard disk space, we can have to do it someother way.

https://github.com/vamseeachanta/energydata/blob/7cdc1e006e90967d809d16a0831af7492c34f1f6/src/energydata/custom/bsee_data_refresh.py

samdansk2 commented 4 months ago

borehole data: api: 608174149400 url: https://www.data.boem.gov/Well/Borehole/Default.aspx/

Done sir , successfully accomplished running test file , got the output , please see below @vamseeachanta

Screenshot (23)

vamseeachanta commented 1 month ago

Workflow of usage for BSEE: a/ APILIST. < 1 week old -> redownload github\energydata\data\bsee\APIRawData.zip

https://github.com/vamseeachanta/assetutilities/blob/0ee779bef997597140c1f51fcbf74f8db27afca1/src/assetutilities/common/data.py https://github.com/vamseeachanta/energydata/blob/7cdc1e006e90967d809d16a0831af7492c34f1f6/src/energydata/custom/bsee_data_refresh.py

b/ Read the API raw data. Load into in-line (in program) database (python). https://www.sqlalchemy.org/ or inlinesql or other?

c/ User Query: will ask for data for a well API 14.

d/ Use this API14 to perform other queries (online). Program wil get all other related data as much as possible.

e/ Last resort: download chesi. break into individual files and them run

vamseeachanta commented 1 month ago
samdansk2 commented 1 month ago

@vamseeachanta Hello sir , Added plantuml for workflow , test for reading zip file .

vamseeachanta commented 1 month ago

@samdansk2 , this is generic task. Code and test for reading zip file should go into AssetUtiliites.

samdansk2 commented 1 month ago

@vamseeachanta hello sir good evening , you said me to try for 10 API's , when i test for those , only one row of data is coming in output like data in below image. is that valid for further Analysis ?

image

samdansk2 commented 1 month ago

these are the API's i have tested ,

Screenshot (35)

vamseeachanta commented 1 month ago

Looks good. Please add a variety of them as tests in our library.

vamseeachanta commented 1 month ago

eg.. test_bsee_well_data_api_number (or) rejig a test file to get through multiple APIs as an array in 1 test file.

figure out your architecture. But keep your program as modular as possible. i.e. looping through a list should be on the very outside and we should be able to change that portion easily.

samdansk2 commented 1 month ago

@vamseeachanta Hello sir modified yaml file to handle for 10 API 's , has given the data in output successfully.

vamseeachanta commented 1 month ago

As mentioned, we need to get all data through url route first. If data not found, we resort to downloaded -> load -> extract required data using queries.

Downloaded data can go stale easily. Online query data will always be most up to date.

samdansk2 commented 1 month ago

@vamseeachanta Hello sir , I think this will be useful for us : https://blog.hubspot.com/website/ai-web-scraping#what-is-an-ai-web-scraper

vamseeachanta commented 1 month ago

Query Tool Development @samdansk2 , I read the hubspot article. It was worth reading and understanding what people are doing. There is no guaranteed solution it will work for us. Even if it works, we have to go through an integration with our program as an open source for anyone to use. Cost, licensing and too many hurdles to go ahead.

RealPython scraper example is what I suggested JC when he was working on. He should have followed below or found a completely different solution in Selenium: https://realpython.com/beautiful-soup-web-scraper-python/ Other related articles: https://www.kdnuggets.com/2023/04/stepbystep-guide-web-scraping-python-beautiful-soup.html

vamseeachanta commented 1 month ago

Recommend a 2 pronged exploration to obtain a sustainable solution: a/ Data Source: understand how BSEE recommends to use their data website. i.e. document workflows (eg: manual clicking (eg: next page with arrow etc.); read grigs or download csv etc.) Please see video tutorials below: https://www.data.bsee.gov/Main/Tutorials/Default.aspx Objective: Is manual querying the best method (or) does BSEE have an API?

b/ Query Tool: Understand the python technologies out there to send online queries. JC has developed one of the online query method by using Selenium (i.e. manually go through operations). If data source is purely manual only, we continue using Selenium solution. Objective: Is manual querying the best method (or) does BSEE have more elegant method. Best method is using a BSEE API? We are essentially building API ourselves if BSEE data source does not have one.

vamseeachanta commented 1 month ago

Understanding Data Source: When we search for API, the url is still nuetral. But there must be a suburl in side the page that will change and give the result. Can we find that URL?

image

vamseeachanta commented 1 month ago

@samdansk2 , please separate this to 2 issues as follows: a/ BSEE Data Source | Spike b/ Data API | Spike

vamseeachanta commented 1 month ago

added existing JC's process below: https://github.com/vamseeachanta/energydata/blob/bseedata/docs/bsee_data_api.puml

samdansk2 commented 1 month ago

@vamseeachanta done sir , added the remaining process

vamseeachanta commented 3 weeks ago

A brief data status update. We have written python code to get following data by Well API12 Number.

A brief overview is also given in link below:

samdansk2 commented 2 weeks ago

old issue of scraping process , not necessary for now