Closed rahulbot closed 8 months ago
An example metadata record from a WARC file (no text_extraction
value in any of 451865 WARC records examined).
{
"rss_entry": {
"link": "https://damienafsoe.ttblogs.com/4775282/the-basic-principles-of-would-or-could",
"title": "The Basic Principles Of would or could",
"domain": "ttblogs.com",
"pub_date": "Mon, 19 Feb 2024 04:51:05 -0000",
"fetch_date": "2024-02-18"
},
"http_metadata": {
"response_code": 200,
"fetch_timestamp": 1708387428.433925,
"final_url": "https://damienafsoe.ttblogs.com/4775282/the-basic-principles-of-would-or-could",
"encoding": "utf-8"
},
"content_metadata": {
"original_url": "https://damienafsoe.ttblogs.com/4775282/the-basic-principles-of-would-or-could",
"url": "https://damienafsoe.ttblogs.com/4775282/the-basic-principles-of-would-or-could",
"normalized_url": "http://damienafsoe.ttblogs.com/4775282/the-basic-principles-of-would-or-could",
"canonical_domain": "ttblogs.com",
"publication_date": "2024-02-19",
"language": "en",
"full_language": "en",
"text_extraction_method": "trafilatura",
"article_title": "The Basic Principles Of would or could",
"normalized_article_title": "the basic principles of would or could",
"text_content": "The Basic Principles Of would or could\nThe Basic Principles Of would or could\nBlog Article\nReaction 29: \u201cI understand that this may possibly demand a lots of effort and hard work and attention on your part, and I would like to supply my support and understanding while you tackle it.\u201d\nLeaves are literally regarded as plant organs as they are a group of tissues which perform the frequent capabilities of photosynthesis, gaseous exchange and transport.\nAlso because you use \u2018will\u2019 previously from the sentence using \u2018can\u2019 sounds superior because that way you are not mixing tenses (i.e. can \u2013 will and could \u2013 would, typically go alongside one another in a similar sentence or phrase). I hope this response aids slightly.\nJanuary Memes - We've been sharing essentially the most humorous and memes to get started on the main thirty day period in the year! From it being the longest month at any time to popular culture tendencies, enjoy & share.\nIn currently\u2019s professional planet, successful interaction is essential for constructing thriving relationships each inside of and outside the office.\nWe use would to seek advice from common habitual steps and activities before. This is usually a formal use and it frequently happens in stories (narratives):\nTo conclude, I like to recommend you utilize \u201cmongooses\u201d given that the plural form of \u201cmongoose.\u201d It is apparently the right word. I\u2019ll make sure you Permit all the business at my evening meal desk know, and you may enable me distribute the term far too.\nRight now we are going to dive in to* the difference between \u2018could\u2019 and \u2018would\u2019 and when to rely on them.\nNevertheless, if the context of your sentence focuses over the people inside the group, you need to take care of it as plural. By way of example: The pod is moving nearer.\nWith the Cambridge English Corpus Potential investigations will also be needed to analyze the host preference from the vector and its feasible Affiliation click here with the mongoose\nRelative clauses Relative clauses referring to an entire sentence Relative clauses: defining and non-defining Relative clauses: common errors\nHuman being A: \"I hate The actual fact that Jeremy slept with Stevie, like what the actual fuck is Improper with him?\"\nReaction 24: \u201cI take pleasure in that you\u2019re the qualified in this location, and I am aware you\u2019ll manage it with your regular skill and professionalism.\u201d\nMost on line reference entries and content articles would not have web page numbers. Consequently, that information is unavailable for the majority of Encyclopedia.com information. Having said that, the date of retrieval is frequently essential. Seek advice from Each and every style\u2019s Conference relating to the best way to format page numbers and retrieval dates.",
"is_homepage": false,
"is_shortened": false,
"parsed_date": "2024-02-20T00:06:10.035042"
}
}
Without knowing much about researchers/front-end needs:
normalized_article_title
and normalized_url
can be retrieved/recreated from the other fields. My vote is to NOT store them in ES.text_extraction
field is for. Any idea @thepsalmist ?text_extraction_method
field:
trafilatura
.Having looked at all the mappings, we're not using text_extraction
, this must have been from the export from initial ES schema.
I'm also in agreement on the rationale to remove the other fields highlighted above.
I think we have agreement to delete these fields from the index: normalized_article_title
, normalized_url
, text_extraction_method
. Editing title and assigning accordingly.
While working on #229 I revisited the ES index schema, which I believe is at
conf/elasticsearch/templates/create_index_template.json
. If that is accurate, I have some questions about the fields:normalized_article_title
: I think this is used in rss-fetcher for deduplication. Do we use it in story-indexer for the same? If not, why are we storing it? It is not useful to users (it is "lossy").normalized_url
: This is used to compute the hash_id
, but is it useful for any other reasons? Any reason we are storing it? Users don't search by this and it is similarly "lossy".text_extraction
: What does this field hold? I can't tell from a quick read of the code.text_extraction_method
: This holds the name of the library we used to extract the text from HTML. Is this useful for researchers enough to support storing?