shanyachaubey / Whats_Happenin-A_big_data_project

0 stars 1 forks source link

[Mongo] Processed Data contains duplicates for key: "top_24_by_topics". #39

Closed codingminions closed 6 months ago

codingminions commented 6 months ago

Steps to Reproduce:

  1. Start your local mongo server with db: 'userquery' and collection: 'sessions'.
  2. Fetch 500 articles for a particular location and date parameters from the newscatcher api. (have attached a file with articles data for Los Angeles, California between 03/02/2024 and 03/20/2024.)
  3. Create an entry in mongodb against the mentioned db&collection.
  4. Execute the fastapi command to generate a PUT request against "/articles" endpoint with the id of the entry made above.
  5. Once the api returns a 200 OK response, open the mongodb compass application and utilize the GUI buttons to see the documents stored in the 'userquery->sessions'.
  6. Go the key: "top_24_by_topics" of the entry made above and click on a Topic with around >7-8 articles.

Output: top_24_by_topics: Object Sports Array (4) 57 57 57 57

processMongoOutput

Expected Output: As you can see, the indices are repeated. The list of article indices for each topic should not contain duplicate indices, ensuring that only unique articles are sent to the frontend interface.

The "/articles" endpoint might be making some assumptions about the kind of data fetched from the newscatcher API. Going through the raw articles data (attached file), I do see some fields which don't have consistent value present across all articles fetched. Needs to be fixed from the data pipeline end.

mongodb document entry text: rawMongoDocumentEntry.txt

codingminions commented 6 months ago

The issue was not related to the fastapi pipeline but the way we were fetching articles from the newscatcher api in the backend code. That's fixed and we see no more duplicates in top_24_by_cat key. Closing this issue.