vfedotovs / sslv_web_scraper

ss.lv web scraping app helps automate information scraping and filtering from classifieds and emails results and stores scraped data in database
GNU General Public License v3.0
5 stars 3 forks source link

BUG: Links missing in email body #230

Closed vfedotovs closed 8 months ago

vfedotovs commented 9 months ago

1 room apartment segment: [Rooms, Floor, Size , Price, SQM Price, Apartment Street, Pub_date, URL] <<< 2 room apartment segment: [Rooms, Floor, Size , Price, SQM Price, Apartment Street, Pub_date, URL] 3 room apartment segment: <<< [Rooms, Floor, Size , Price, SQM Price, Apartment Street, Pub_date, URL] 4 room apartment segment: <<< [Rooms, Floor, Size , Price, SQM Price, Apartment Street, Pub_date, URL] <<<

Each section should be like that : [Rooms, Floor, Size , Price, SQM Price, Apartment Street, Pub_date, URL] 1 2/2 26.0 13000 500.0 Tinužu 8 25.11.2023 https://ss.lv/msg/lv/real-estate/flats/ogre-and-reg/ogre/adbfh.html --- cut --- 1 7/9 43.0 49500 1151.16 Tīnūžu 7 02.12.2023 https://ss.lv/msg/lv/real-estate/flats/ogre-and-reg/ogre/dicjf.html 2 room apartment segment:

vfedotovs commented 8 months ago

Issue is in df_cleaner.py function create_email_body: line 218 217 email_body_txt.append(section_line) 218 filtered_by_room_count = clean_data_frame.loc[clean_data_frame['Room_count'] == str(room_count_str)] <<<

(Pdb) p filtered_by_room_count Empty DataFrame Columns: [Unnamed: 0, URL, Room_count, Floor, Street, Pub_date, Size_sqm, Price_in_eur, SQ_meter_price] Index: [] <<<

Bug was introduced after special case handling 'citi' in DataFrame Room_count column dtype was object now after DataFrame Room_count column values are only integers column dtype value is int64.

Fix is to add both case handling.

vfedotovs commented 8 months ago

Issue is resolved in ad0b399