delete unnecessary ES index fields

rahulbot commented 8 months ago

While working on #229 I revisited the ES index schema, which I believe is at conf/elasticsearch/templates/create_index_template.json. If that is accurate, I have some questions about the fields:

normalized_article_title: I think this is used in rss-fetcher for deduplication. Do we use it in story-indexer for the same? If not, why are we storing it? It is not useful to users (it is "lossy").
normalized_url: This is used to compute the hash _id, but is it useful for any other reasons? Any reason we are storing it? Users don't search by this and it is similarly "lossy".
text_extraction: What does this field hold? I can't tell from a quick read of the code.
text_extraction_method: This holds the name of the library we used to extract the text from HTML. Is this useful for researchers enough to support storing?

philbudne commented 8 months ago

An example metadata record from a WARC file (no text_extraction value in any of 451865 WARC records examined).

{
  "rss_entry": {
    "link": "https://damienafsoe.ttblogs.com/4775282/the-basic-principles-of-would-or-could",
    "title": "The Basic Principles Of would or could",
    "domain": "ttblogs.com",
    "pub_date": "Mon, 19 Feb 2024 04:51:05 -0000",
    "fetch_date": "2024-02-18"
  },
  "http_metadata": {
    "response_code": 200,
    "fetch_timestamp": 1708387428.433925,
    "final_url": "https://damienafsoe.ttblogs.com/4775282/the-basic-principles-of-would-or-could",
    "encoding": "utf-8"
  },
  "content_metadata": {
    "original_url": "https://damienafsoe.ttblogs.com/4775282/the-basic-principles-of-would-or-could",
    "url": "https://damienafsoe.ttblogs.com/4775282/the-basic-principles-of-would-or-could",
    "normalized_url": "http://damienafsoe.ttblogs.com/4775282/the-basic-principles-of-would-or-could",
    "canonical_domain": "ttblogs.com",
    "publication_date": "2024-02-19",
    "language": "en",
    "full_language": "en",
    "text_extraction_method": "trafilatura",
    "article_title": "The Basic Principles Of would or could",
    "normalized_article_title": "the basic principles of would or could",
    "text_content": "The Basic Principles Of would or could\nThe Basic Principles Of would or could\nBlog Article\nReaction 29: \u201cI understand that this may possibly demand a lots of effort and hard work and attention on your part, and I would like to supply my support and understanding while you tackle it.\u201d\nLeaves are literally regarded as plant organs as they are a group of tissues which perform the frequent capabilities of photosynthesis, gaseous exchange and transport.\nAlso because you use \u2018will\u2019 previously from the sentence using \u2018can\u2019 sounds superior because that way you are not mixing tenses (i.e. can \u2013 will and could \u2013 would, typically go alongside one another in a similar sentence or phrase). I hope this response aids slightly.\nJanuary Memes - We've been sharing essentially the most humorous and memes to get started on the main thirty day period in the year! From it being the longest month at any time to popular culture tendencies, enjoy & share.\nIn currently\u2019s professional planet, successful interaction is essential for constructing thriving relationships each inside of and outside the office.\nWe use would to seek advice from common habitual steps and activities before. This is usually a formal use and it frequently happens in stories (narratives):\nTo conclude, I like to recommend you utilize \u201cmongooses\u201d given that the plural form of \u201cmongoose.\u201d It is apparently the right word. I\u2019ll make sure you Permit all the business at my evening meal desk know, and you may enable me distribute the term far too.\nRight now we are going to dive in to* the difference between \u2018could\u2019 and \u2018would\u2019 and when to rely on them.\nNevertheless, if the context of your sentence focuses over the people inside the group, you need to take care of it as plural. By way of example: The pod is moving nearer.\nWith the Cambridge English Corpus Potential investigations will also be needed to analyze the host preference from the vector and its feasible Affiliation click here with the mongoose\nRelative clauses Relative clauses referring to an entire sentence Relative clauses: defining and non-defining Relative clauses: common errors\nHuman being A: \"I hate The actual fact that Jeremy slept with Stevie, like what the actual fuck is Improper with him?\"\nReaction 24: \u201cI take pleasure in that you\u2019re the qualified in this location, and I am aware you\u2019ll manage it with your regular skill and professionalism.\u201d\nMost on line reference entries and content articles would not have web page numbers. Consequently, that information is unavailable for the majority of Encyclopedia.com information. Having said that, the date of retrieval is frequently essential. Seek advice from Each and every style\u2019s Conference relating to the best way to format page numbers and retrieval dates.",
    "is_homepage": false,
    "is_shortened": false,
    "parsed_date": "2024-02-20T00:06:10.035042"
  }
}

kilemensi commented 8 months ago

Without knowing much about researchers/front-end needs:

Anything in normalized_article_title and normalized_url can be retrieved/recreated from the other fields. My vote is to NOT store them in ES.
No clue what text_extraction field is for. Any idea @thepsalmist ?
As is, my vote would also be to NOT store text_extraction_method field:
1. Just knowing the library without knowing the version, etc., will not be enough to allow any future comparison should there be a need to compare with different libraries or extraction methods (e.g. local LLM).
2. Should the need to store this field arise (e.g. the day we change the library), we can safely (accurately?) assuming anything before that was the current library i.e. trafilatura.

thepsalmist commented 8 months ago

Having looked at all the mappings, we're not using text_extraction, this must have been from the export from initial ES schema. I'm also in agreement on the rationale to remove the other fields highlighted above.

rahulbot commented 8 months ago

I think we have agreement to delete these fields from the index: normalized_article_title, normalized_url, text_extraction_method. Editing title and assigning accordingly.

mediacloud / story-indexer

delete unnecessary ES index fields #243