opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.46k stars 1.74k forks source link

[BUG] "Did you mean" TermSuggester does not work as expected in multi-shard setup #8091

Open danithaca opened 1 year ago

danithaca commented 1 year ago

Describe the bug

We have a multi-shard setup and we want to support query Spell Correction using "Did you mean" TermSuggester as described at https://opensearch.org/docs/latest/search-plugins/searching-data/did-you-mean/#term-suggester. However, it sometimes does not work as expected. For example, when search actor (and this term appears multiple times in the corpus), it gets incorrectly suggested to be after, action, altos (see screenshot below).

To Reproduce

Step 1: Create a simple index, with 8 shards

PUT /term-suggester-debug
{
  "mappings": {
    "properties": {
      "document": {
        "type": "text",
        "analyzer": "standard"
      }
    }
  },
  "settings": {
      "index" : {
        "number_of_shards" : "8"
      }
  }
}

Step 2: Ingest 5 documents. Note that actor appears multiple times in the corpus.

PUT /term-suggester-debug/_doc/obama
{
  "document": "Barack Hussein Obama II (/bəˈrɑːk huːˈseɪn oʊˈbɑːmə/ (listen) bə-RAHK hoo-SAYN oh-BAH-mə;[1] born August 4, 1961) is an American politician who served as the 44th president of the United States from 2009 to 2017. He was the first African-American president of the United States.[2] A member of the Democratic Party, he previously served as a U.S. senator from Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004. Obama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago. In 1988, he enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review. After graduating, he became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004. Turning to elective politics, he represented the 13th district in the Illinois Senate from 1997 until 2004, when he ran for the U.S. Senate. Obama received national attention in 2004 with his March Senate primary win, his well-received July Democratic National Convention keynote address, and his landslide November election to the Senate. In 2008, a year after beginning his campaign, and after a close primary campaign against Hillary Clinton, he was nominated by the Democratic Party for president. Obama was elected over Republican nominee John McCain in the general election and was inaugurated alongside his running mate Joe Biden, on January 20, 2009. Nine months later, he was named the 2009 Nobel Peace Prize laureate, a decision that drew a mixture of praise and criticism. Obama signed many landmark bills into law during his first two years in office. The main reforms include: the Affordable Care Act (ACA or \"Obamacare\"), although without a public health insurance option; the Dodd–Frank Wall Street Reform and Consumer Protection Act. The American Recovery and Reinvestment Act and Tax Relief, Unemployment Insurance Reauthorization, and Job Creation Act served as economic stimuli amidst the Great Recession. After a lengthy debate over the national debt limit, he signed the Budget Control and the American Taxpayer Relief Acts. In foreign policy, he increased U.S. troop levels in Afghanistan, reduced nuclear weapons with the United States–Russia New START treaty, and ended military involvement in the Iraq War. In 2011, Obama ordered the drone-strike killing of Anwar al-Awlaki a US citizen and suspected al-Qaeda operative, leading to controversy. He ordered military involvement in Libya for the implementation of the UN Security Council Resolution 1973, contributing to the overthrow of Muammar Gaddafi. He also ordered the military operation that resulted in the death of Osama bin Laden. After winning re-election by defeating Republican opponent Mitt Romney, Obama was sworn in for a second term on January 20, 2013. During this term, he promoted inclusion for LGBT Americans. His administration filed briefs that urged the Supreme Court to strike down same-sex marriage bans as unconstitutional (United States v. Windsor and Obergefell v. Hodges); same-sex marriage was legalized nationwide in 2015 after the Court ruled so in Obergefell. He advocated for gun control in response to the Sandy Hook Elementary School shooting, indicating support for a ban on assault weapons, and issued wide-ranging executive actions concerning global warming and immigration. In foreign policy, he ordered military interventions in Iraq and Syria in response to gains made by ISIL after the 2011 withdrawal from Iraq, promoted discussions that led to the 2015 Paris Agreement on global climate change, oversaw the deadly Kunduz hospital airstrike, drew down U.S. troops in Afghanistan in 2016, initiated sanctions against Russia following the Annexation of Crimea and again after interference in the 2016 U.S. elections, brokered the Joint Comprehensive Plan of Action nuclear deal with Iran, and normalized U.S. relations with Cuba. Obama nominated three justices to the Supreme Court: Sonia Sotomayor and Elena Kagan were confirmed as justices, while Merrick Garland was denied hearings or a vote from the Republican-majority Senate. Obama left office on January 20, 2017, and continues to reside in Washington, D.C. During Obama's terms as president, the United States' reputation abroad, as well as the American economy, significantly improved. Scholars and historians rank him among the upper to mid tier of American presidents. Since leaving office, Obama has remained active in Democratic politics, including campaigning for candidates in the 2018 midterm elections, appearing at the 2020 Democratic National Convention and campaigning for Biden during the 2020 presidential election. Outside of politics, Obama has published three bestselling books: Dreams from My Father (1995), The Audacity of Hope (2006) and A Promised Land (2020). "
}

PUT /term-suggester-debug/_doc/ronaldo
{
  "document": "Cristiano Ronaldo dos Santos Aveiro GOIH ComM (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaɫdu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for Premier League club Manchester United and captains the Portugal national team. Often considered the best player in the world and widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards[note 3] and four European Golden Shoes, the most by a European player. He has won 32 trophies in his career, including seven league titles, five UEFA Champions Leagues, one UEFA European Championship, and one UEFA Nations League. Ronaldo holds the records for most appearances (183), most goals (140), and assists (42) in the Champions League, most goals in the European Championship (14), most international goals by a male player (117), and most international appearances by a European male (189). He is one of the few players to have made over 1,100 professional career appearances, and has scored over 800 official senior career goals for club and country.Ronaldo began his senior career with Sporting CP, before signing with Manchester United in 2003, aged 18, winning the FA Cup in his first season. He would also go onto win three consecutive Premier League titles, the Champions League and the FIFA Club World Cup; at age 23, he won his first Ballon d'Or. Ronaldo was the subject of the then-most expensive association football transfer when he signed for Real Madrid in 2009 in a transfer worth €94 million (£80 million), where he won 15 trophies, including two La Liga titles, two Copa del Rey, and four Champions Leagues, and became the club's all-time top goalscorer. He won back-to-back Ballons d'Or in 2013 and 2014, and again in 2016 and 2017, and was runner-up three times behind Lionel Messi, his perceived career rival. In 2018, he signed for Juventus in a transfer worth an initial €100 million (£88 million), the most expensive transfer for an Italian club and the most expensive for a player over 30 years old. He won two Serie A titles, two Supercoppe Italiana, and a Coppa Italia, before returning to United in 2021. Ronaldo made his international debut for Portugal in 2003 at the age of 18 and has since earned over 180 caps, making him Portugal's most-capped player. With more than 100 goals at international level, he is also the nation's all-time top goalscorer. Ronaldo has played in and scored at 11 major tournaments; he scored his first international goal at Euro 2004, where he helped Portugal reach the final. He assumed captaincy of the national team in July 2008. In 2015, Ronaldo was named the best Portuguese player of all time by the Portuguese Football Federation. The following year, he led Portugal to their first major tournament title at Euro 2016, and received the Silver Boot as the second-highest goalscorer of the tournament. He also led them to victory in the inaugural UEFA Nations League in 2019, and later received the Golden Boot as top scorer of Euro 2020.One of the world's most marketable and famous athletes, Ronaldo was ranked the world's highest-paid athlete by Forbes in 2016 and 2017 and the world's most famous athlete by ESPN from 2016 to 2019. Time included him on their list of the 100 most influential people in the world in 2014. He is the first footballer and the third sportsman to earn US$1 billion in his career.[8] Cristiano Ronaldo dos Santos Aveiro was born in the São Pedro parish of Funchal, the capital of the Portuguese island of Madeira, and grew up in the nearby parish of Santo António.[9][10] He is the fourth and youngest child of Maria Dolores dos Santos Viveiros da Aveiro, a cook, and José Dinis Aveiro, a municipal gardener and part-time kit man.[11] His great-grandmother on his father's side, Isabel da Piedade, was from the island of São Vicente, Cape Verde.[12] He has one older brother, Hugo, and two older sisters, Elma and Liliana Cátia Katia.[13] His mother revealed that she wanted to abort him due to poverty, his father's alcoholism and having too many children already, but her doctor refused to perform the procedure.[14] Ronaldo grew up in an impoverished Catholic home, sharing a room with all his siblings.[15] As a child, Ronaldo played for Andorinha from 1992 to 1995,[16] where his father was the kit man,[11] and later spent two years with Nacional. In 1997, aged 12, he went on a three-day trial with Sporting CP, who signed him for a fee of £1,500.[17] He subsequently moved from Madeira to Alcochete, near Lisbon, to join Sporting's youth academy.[17] By age 14, Ronaldo believed he had the ability to play semi-professionally and agreed with his mother to cease his education to focus entirely on football.[18] While popular with other students at school, he had been expelled after throwing a chair at his teacher, who he said had disrespected him.[18] One year later, he was diagnosed with tachycardia, a condition that could have forced him to give up playing football.[19] Ronaldo underwent heart surgery where a laser was used to cauterise multiple cardiac pathways into one, altering his resting heart rate.[20] He was discharged from the hospital hours after the procedure and resumed training a few days later.[21] "
}

PUT /term-suggester-debug/_doc/rock
{
  "document": "Dwayne Douglas Johnson (born May 2, 1972), also known by his ring name The Rock,[3] is an American actor, businessman, and former professional wrestler.[6][7] Widely regarded as one of the greatest professional wrestlers of all time,[8][9] he wrestled for WWE for eight years prior to pursuing an acting career. His films have grossed over $3.5 billion in North America and over $10.5 billion worldwide,[10] making him one of the world's highest-grossing and highest-paid actors.[11][12] Johnson played college football at the University of Miami and won a national championship in 1991. He aspired to a professional career in football, but went undrafted in the 1995 NFL Draft. He signed with the Calgary Stampeders of the Canadian Football League (CFL), but was cut from the team in his first season.[13] Part of the Anoa'i family, Johnson's father Rocky and maternal grandfather Peter Maivia were professional wrestlers, and he secured a contract with the World Wrestling Federation (WWF, now WWE) in 1996.[2] He rose to prominence after developing the gimmick of a charismatic trash-talker and helped usher in the Attitude Era, an industry boom period in the late 1990s and early 2000s.[14] Johnson left WWE in 2004 and returned in 2011 as a part-time performer until 2013, making sporadic appearances until retiring in 2019.[15] A 10-time world champion, including the promotion's first of African-American descent,[16] he is also a two-time Intercontinental Champion, a five-time Tag Team Champion, the 2000 Royal Rumble winner, and WWE's sixth Triple Crown champion. Johnson headlined the most-bought professional wrestling pay-per-view (WrestleMania XXVIII) and was featured among the most watched episodes of WWE's flagship television series (Raw and SmackDown).[17][18] Johnson's first leading role was as the titular character in the sword and sorcery film The Scorpion King (2002). He has since starred in the comedies The Game Plan (2007), Tooth Fairy (2010), and Central Intelligence (2016); the action-adventure films Journey 2: The Mysterious Island (2012), G.I. Joe: Retaliation (2013), Hercules (2014), and Skyscraper (2018); the science-fiction films San Andreas (2015) and Rampage (2018), and the animated film Moana (2016). His role as Luke Hobbs in the Fast & Furious films, beginning with Fast Five (2011), has helped it become one of the highest-grossing film franchises.[19] Johnson also stars in the Jumanji films, appearing in Jumanji: Welcome to the Jungle (2017) and Jumanji: The Next Level (2019), and is set to portray Black Adam in its superhero film adaptation.Johnson produced and starred in the HBO comedy-drama series Ballers (2015–2019)[20] and stars and produces the autobiographical sitcom Young Rock (2021). In 2000, Johnson released the autobiography The Rock Says, which was a New York Times bestseller.[21][22] In 2012, he co-founded the entertainment production company Seven Bucks Productions[23] and is the co-owner of American football league, the XFL.[24][25] In 2016 and 2019, Johnson was named by Time one of the world's most influential people.[26][27] Johnson was born in Hayward, California[28] on May 2, 1972,[29] the son of Ata Johnson (née Maivia; born 1948)[30] and former professional wrestler Rocky Johnson (born Wayde Douglas Bowles; 1944–2020).[31][32] Growing up, Johnson lived briefly in Grey Lynn in Auckland with his mother's family,[33] where he played rugby[34] and attended Richmond Road Primary School before returning to the U.S.[33] Johnson's father was a Black Nova Scotian with a small amount of Irish ancestry.[35][36] His mother is Samoan. His father and tag team partner Tony Atlas were the first black tag team champions in WWE history.[37][38] His mother is the adopted daughter of Peter Maivia, who was also a pro wrestler.[39] Johnson's maternal grandmother Lia was the first female pro wrestling promoter, taking over Polynesian Pacific Pro Wrestling after her husband's death in 1982 and managing it until 1988.[40][41] Through his maternal grandfather Maivia, Johnson is a non-blood relative to the Anoa'i wrestling family.[42][43][44][45][46] In 2008, Johnson inducted his father and grandfather into the WWE Hall of Fame.[47] "
}

PUT /term-suggester-debug/_doc/bezos
{
  "document": "Jeffrey Preston Bezos (/ˈbeɪzoʊs/ BAY-zohss;[1] né Jorgensen; born January 12, 1964) is an American entrepreneur, media proprietor, investor, computer engineer, and commercial astronaut.[2][3] He is the founder, executive chairman and former president and CEO of Amazon. With a net worth of around US$146 billion as of June 2022, Bezos is the second-wealthiest person in the world and was the wealthiest from 2017 to 2021 according to both Bloomberg's Billionaires Index and Forbes.[4][5] Born in Albuquerque and raised in Houston and Miami, Bezos graduated from Princeton University in 1986. He holds a degree in electrical engineering and computer science. He worked on Wall Street in a variety of related fields from 1986 to early 1994. Bezos founded Amazon in late 1994, on a road trip from New York City to Seattle. The company began as an online bookstore and has since expanded to a variety of other e-commerce products and services, including video and audio streaming, cloud computing, and artificial intelligence. It is currently the world's largest online sales company, the largest Internet company by revenue, and the largest provider of virtual assistants and cloud infrastructure services through its Amazon Web Services branch.Bezos founded the aerospace manufacturer and sub-orbital spaceflight services company Blue Origin in 2000. Blue Origin's New Shepard vehicle reached space in 2015, and afterwards successfully landed back on Earth. He also purchased the major American newspaper The Washington Post in 2013 for $250 million, and manages many other investments through his venture capital firm, Bezos Expeditions. In September 2021, Bezos co-founded biotechnology company Altos Labs with Mail.ru founder Yuri Milner.[6] The first centibillionaire on the Forbes wealth index,[7] Bezos was named the richest man in modern history after his net worth increased to $150 billion in July 2018.[8] In August 2020, according to Forbes, he had a net worth exceeding $200 billion.[9] In 2020 during the COVID-19 pandemic, his wealth grew by approximately $24 billion.[10] On July 5, 2021, Bezos stepped down as the CEO of Amazon and transitioned into the role of executive chairman; Andy Jassy, the chief of Amazon's cloud computing division,[11][12] replaced Bezos as the CEO of Amazon. On July 20, 2021, he flew to space alongside his brother Mark.[13] The suborbital flight lasted over 10 minutes, reaching a peak altitude of 66.5 miles (107.0 km).[14] Jeffrey Preston Jorgensen was born in Albuquerque, New Mexico, on January 12, 1964,[15] the son of Jacklyn (née Gise) and Theodore Jørgensen.[16] At the time of Jeffrey's birth, his mother was a 17-year-old high school student and his father was 19 years old.[17] Theodore Jorgensen had ancestry from Denmark and was born in Chicago to a family of Baptists.[18] After completing high school despite challenging conditions, Jacklyn attended night school while bringing Jeffrey along as a baby.[19] After his parents divorced, his mother married Cuban immigrant Miguel Mike Bezos in April 1968.[20] Shortly after the wedding, Mike adopted four-year-old Jeffrey, whose surname was then legally changed from Jorgensen to Bezos.[21] After Mike had received his degree from the University of New Mexico, the family moved to Houston, Texas, so that he could begin working as an engineer for Exxon.[22] Jeff Bezos attended a Montessori school in Albuquerque, New Mexico, when he was two years old.[23] Jeff Bezos attended River Oaks Elementary School in Houston from fourth to sixth grade.[24] Bezos' maternal grandfather was Lawrence Preston Gise, a regional director of the U.S. Atomic Energy Commission (AEC) in Albuquerque.[25] Gise retired early to his family's ranch near Cotulla, Texas, where Bezos would spend many summers in his youth.[26] Bezos would later purchase this ranch and expand it from 25,000 acres (10,117 ha) to 300,000 acres (121,406 ha).[27][28] Bezos displayed scientific interests and technological proficiency, and once rigged an electric alarm to keep his younger siblings out of his room.[29][30] The family moved to Miami, Florida, where Bezos attended Miami Palmetto High School.[31][32] While Bezos was in high school, he worked at McDonald's as a short-order line cook during the breakfast shift.[33] Bezos attended the Student Science Training Program at the University of Florida. He was high school valedictorian, a National Merit Scholar,[34][35] and a Silver Knight Award winner in 1982.[34] In his graduation speech, Bezos told the audience he dreamed of the day when mankind would colonize space. A local newspaper quoted his intention to get all people off the earth and see it turned into a huge national park.[36] In 1986, he graduated summa cum laude from Princeton University with a 4.2 GPA and a Bachelor of Science in Engineering degree (B.S.E.) in electrical engineering and computer science; he was also a member of Phi Beta Kappa.[37][38] While at Princeton, Bezos was a member of the Quadrangle Club, one of Princeton's 11 eating clubs.[39] In addition, he was elected to Tau Beta Pi and was the president of the Princeton chapter of the Students for the Exploration and Development of Space (SEDS).[40][41] "
}

PUT /term-suggester-debug/_doc/robertjr
{
  "document": "Robert John Downey Jr. (born April 4, 1965)[1] is an American actor and producer. His career has been characterized by critical and popular success in his youth, followed by a period of substance abuse and legal troubles, before a resurgence of commercial success later in his career. In 2008, Downey was named by Time magazine among the 100 most influential people in the world,[2][3] and from 2013 to 2015, he was listed by Forbes as Hollywood's highest-paid actor.[2][3] At the age of 5, he made his acting debut in his father's film Pound in 1970. He subsequently worked with the Brat Pack in the teen films Weird Science (1985) and Less Than Zero (1987). In 1992, Downey portrayed the title character in the biopic Chaplin, for which he was nominated for the Academy Award for Best Actor and won a BAFTA Award. Following a stint at the Corcoran Substance Abuse Treatment Facility on drug charges, he joined the TV series Ally McBeal, for which he won a Golden Globe Award. He was fired from the show in the wake of drug charges in 2000 and 2001. He stayed in a court-ordered drug treatment program and has maintained his sobriety since 2003.Initially, completion bond companies would not insure Downey, until Mel Gibson paid the insurance bond for the 2003 film The Singing Detective.[4] He went on to star in the black comedy Kiss Kiss Bang Bang (2005), the thriller Zodiac (2007), and the action comedy Tropic Thunder (2008); for the latter he was nominated for an Academy Award for Best Supporting Actor. Downey gained global recognition for starring as Tony Stark / Iron Man in ten films within the Marvel Cinematic Universe, beginning with Iron Man (2008). He has also played the title character in Guy Ritchie's Sherlock Holmes (2009), which earned him his second Golden Globe, and its sequel, Sherlock Holmes: A Game of Shadows (2011). Robert John Downey Jr. (born April 4, 1965)[1] is an American actor and producer. His career has been characterized by critical and popular success in his youth, followed by a period of substance abuse and legal troubles, before a resurgence of commercial success later in his career. In 2008, Downey was named by Time magazine among the 100 most influential people in the world,[2][3] and from 2013 to 2015, he was listed by Forbes as Hollywood's highest-paid actor.[2][3] At the age of 5, he made his acting debut in his father's film Pound in 1970. He subsequently worked with the Brat Pack in the teen films Weird Science (1985) and Less Than Zero (1987). In 1992, Downey portrayed the title character in the biopic Chaplin, for which he was nominated for the Academy Award for Best Actor and won a BAFTA Award. Following a stint at the Corcoran Substance Abuse Treatment Facility on drug charges, he joined the TV series Ally McBeal, for which he won a Golden Globe Award. He was fired from the show in the wake of drug charges in 2000 and 2001. He stayed in a court-ordered drug treatment program and has maintained his sobriety since 2003.Initially, completion bond companies would not insure Downey, until Mel Gibson paid the insurance bond for the 2003 film The Singing Detective.[4] He went on to star in the black comedy Kiss Kiss Bang Bang (2005), the thriller Zodiac (2007), and the action comedy Tropic Thunder (2008); for the latter he was nominated for an Academy Award for Best Supporting Actor. Downey gained global recognition for starring as Tony Stark / Iron Man in ten films within the Marvel Cinematic Universe, beginning with Iron Man (2008). He has also played the title character in Guy Ritchie's Sherlock Holmes (2009), which earned him his second Golden Globe, and its sequel, Sherlock Holmes: A Game of Shadows (2011)."
}

Step 3: Make a TermSuggester call, and note that actor got incorrectly suggested to after etc. We tried different values for max_term_freq, such as 10, 0, 0.01, 0.0001 and the same problem persists.

GET /term-suggester-debug/_search
{
  "suggest": {
    "querySuggest": {
      "text": "actor",
      "term": {
          "field": "document",
        "max_edits": 2,
        "string_distance": "INTERNAL",
        "suggest_mode": "MISSING",
        "accuracy": 0.6,
        "sort": "SCORE",
        "max_term_freq": 0.0001
       }
   }
  }
}

Expected behavior

According to the documentation https://opensearch.org/docs/latest/search-plugins/searching-data/did-you-mean/#term-suggester, we expect max_term_freq controls at what point a term is considered a valid word and not as a typo to be corrected at all. For example, given that actor appears multiple times and if max_term_freq is set to be 1, then actor would be treated as a valid word because it appears >1 and should not get corrected. However, max_term_freq doesn't seem to affect the results and the suggestion is obviously wrong.

Plugins N/A

Screenshots

termsuggester

Host/Environment (please complete the following information):

Additional context Related issue: #4529 CC: @macohen @msfroh @noCharger

dblock commented 1 year ago

By the description of the issue I assume this works reliably on a 1-node setup?

You're using ES 7.9, so first we should figure out whether this was fixed by ES 7.10, and/or whether it's still broken in OpenSearch 2.x. Want to try and narrow it down with newer versions of the software?

For AWS do open a ticket with support, they might know of another case like this.

msfroh commented 1 year ago

This is definitely still applicable in recent releases of OpenSearch.

The issue is that doc frequency for terms is evaluated per shard at the Lucene level and suggestions are returned if the term's frequency (on the given shard) is below the threshold. I'm pretty sure it doesn't depend on the node count.

I personally like @noCharger's approach 2 in https://github.com/opensearch-project/OpenSearch/issues/8174 to address this. The suggestions can come back from the shards, along with the evaluated term's frequency. If the sum of a term's frequencies across all the shards exceeds the new threshold parameter, then we don't offer suggestions (for that term).