vfedotovs / sslv_web_scraper

ss.lv web scraping app helps automate information scraping and filtering from classifieds and emails results and stores scraped data in database
GNU General Public License v3.0
5 stars 3 forks source link

BUG(ws): All arrays must be of the same length #286

Closed vfedotovs closed 3 months ago

vfedotovs commented 3 months ago

Detected in:

dup_db_issue_2024_08_11]$ ls -lr /tmp
122712 Aug 11 08:05 pg_backup_2024_08_11.sql << dup 4
122712 Aug 10 08:05 pg_backup_2024_08_10.sql << dup 3
122712 Aug  9 08:05 pg_backup_2024_08_09.sql << dup 2
122712 Aug  8 08:05 pg_backup_2024_08_08.sql - ok ???
122689 Aug  7 08:05 pg_backup_2024_08_07.sql

Error from WS log files

2024-08-11 00:54:53,061 [MainThread  ] [INFO ] web_scraper : extract_data_from_url: 98: Extracting data from message URL  61
2024-08-11 00:55:00,057 [MainThread  ] [INFO ] web_scraper : extract_data_from_url: 98: Extracting data from message URL  62
2024-08-11 00:55:05,522 [MainThread  ] [INFO ] web_scraper : scrape_website: 84: Creating file Ogre-raw-data-report.txt copy in data folder
2024-08-11 00:55:05,604 [MainThread  ] [INFO ] web_scraper : scrape_website: 86: --- Finished web_scraper module ---
---- 

2024-08-11 00:55:05,611 [MainThread  ] [INFO ] : run_long_task: 100: Running data_formater_main task: using locally scraped file
2024-08-11 00:55:05,612 [MainThread  ] [INFO ] : cloud_data_formater_main: 114:  --- Started data_format_changer module ---
2024-08-11 00:55:05,613 [MainThread  ] [INFO ] : cloud_data_formater_main: 116: AWS lambda scraped file path: local_lambda_raw_scraped_data/Ogre-raw-data-report-2024-08-11.txt
2024-08-11 00:55:05,614 [MainThread  ] [INFO ] : check_todays_cloud_data_file_exist: 84: Searching for cloud files with todays date: 2024-08-11

--- cut --- 
Current path: /
2024-08-11 00:55:05,614 [MainThread  ] [INFO ] : check_todays_cloud_data_file_exist: 93: File Ogre-raw-data-report-2024-08-11.txt containing today date 2024-08-11 found,
2024-08-11 00:55:05,615 [MainThread  ] [INFO ] : cloud_data_formater_main: 119: AWS lambda scraped file exists: True
2024-08-11 00:55:05,615 [MainThread  ] [INFO ] : cloud_data_formater_main: 122: Creating one-line report using lambda scraped file: local_lambda_raw_scraped_data/Ogre-raw-data-report-2024-08-11.txt
2024-08-11 00:55:05,668 [MainThread  ] [ERROR] : create_oneline_report: 231: An error occurred while processing the file local_lambda_raw_scraped_data/Ogre-raw-data-report-2024-08-11.txt : All arrays must be of the same length  <<<<

INFO:     172.26.0.4:56044 - "GET /run-task/ogre HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
-- cut -- 
  File "/usr/local/lib/python3.8/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/app/main.py", line 101, in run_long_task
    cloud_data_formater_main()
  File "/app/wsmodules/data_format_changer.py", line 126, in cloud_data_formater_main
    ogre_city_data_frame.to_csv("pandas_df.csv")
AttributeError: 'NoneType' object has no attribute 'to_csv'   <<<<<
vfedotovs commented 3 months ago

Inconsistent Data Lengths: The error All arrays must be of the same length suggests that when constructing a DataFrame, the lists or arrays provided as data have different lengths, which is not allowed in Pandas. This likely causes the DataFrame construction to fail, resulting in a NoneType object.

AttributeError on to_csv: The error AttributeError: 'NoneType' object has no attribute 'to_csv' indicates that ogre_city_data_frame is None, which happens because the DataFrame creation failed due to the previous error.

Proposed fix example

Check Data Lengths Before Creating DataFrame Ensure that all the arrays or lists you are using to create the DataFrame have the same length.

Root cause why dataFrame creation function call failed is inconsistent attribute count for all ad entries:

Ogre-raw-data-report-2024-08-07.txt - consistent attribute count
  96 Date
  96 https
Ogre-raw-data-report-2024-08-08.txt - consistent attribute count
  61 Date
  61 https
Ogre-raw-data-report-2024-08-10.txt - fails 
  81 Date <<<  for each ad attribute count is NOT consistent
  82 https <<< 
Ogre-raw-data-report-2024-08-11.txt - fails 
 143 Date <<< for each ad attribute count is NOT consistent
 144 https <<<
vfedotovs commented 3 months ago

Issue has been resolved in 8b6aa3d