shtoshni / fast-coref

Code for the CRAC 2021 paper "On Generalization in Coreference Resolution" (Best short paper award)
31 stars 13 forks source link

Cluster index does not align with output['tokenized_doc']['orig_tokens'] #10

Closed mithunb closed 2 years ago

mithunb commented 2 years ago

Hi, I am using the Colab notebook provided by you. Here is the sample text that I provided:

doc = """Elon Reeve Musk FRS (born June 28, 1971) is a business magnate and investor. He is the founder, CEO, and Chief Engineer at SpaceX; angel investor, CEO, and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. With an estimated net worth of around US$203 billion as of June 2022,[4] Musk is the wealthiest person in the world according to both the Bloomberg Billionaires Index and Forbes' real-time billionaires list.[5][6]

Musk was born to White South African parents in Pretoria, where he grew up. He briefly attended the University of Pretoria before moving to Canada at age 17, acquiring citizenship through his Canadian-born mother. He matriculated at Queen's University and transferred to the University of Pennsylvania two years later, where he received bachelor's degrees in Economics and Physics. He moved to California in 1995 to attend Stanford University but decided instead to pursue a business career, co-founding the web software company Zip2 with his brother Kimbal. The startup was acquired by Compaq for $307 million in 1999. The same year, Musk co-founded online bank X.com, which merged with Confinity in 2000 to form PayPal. The company was bought by eBay in 2002 for $1.5 billion.

In 2002, Musk founded SpaceX, an aerospace manufacturer and space transport services company, of which he serves as CEO and Chief Engineer. In 2004, he was an early investor in electric vehicle manufacturer Tesla Motors, Inc. (now Tesla, Inc.). He became its chairman and product architect, eventually assuming the position of CEO in 2008. In 2006, he helped create SolarCity, a solar energy company that was later acquired by Tesla and became Tesla Energy. In 2015, he co-founded OpenAI, a nonprofit research company promoting friendly artificial intelligence (AI). In 2016, he co-founded Neuralink, a neurotechnology company focused on developing brain–computer interfaces, and founded The Boring Company, a tunnel construction company. He agreed to purchase the major American social networking service Twitter in 2022 for $44 billion. Musk has proposed the Hyperloop, a high-speed vactrain transportation system, and is the president of the Musk Foundation, an organization which donates to scientific research and education.

Musk has been criticized for making unscientific and controversial statements, such as spreading misinformation about the COVID-19 pandemic. In 2018, he was sued by the US Securities and Exchange Commission (SEC) for falsely tweeting that he had secured funding for a private takeover of Tesla; he settled with the SEC but did not admit guilt, and he temporarily stepped down from his Tesla chairmanship. In 2019, he won a defamation case brought against him by a British caver who had advised in the Tham Luang cave rescue """

And, I get the following non-singleton clusters. [((0, 12), 'Elon Reeve Musk FRS ( born June 28 , 1971 )'), ((21, 21), 'He'), ((84, 84), 'Musk'), ((115, 115), 'Musk'), ((128, 128), 'he'), ((132, 132), 'He'), ((151, 151), 'his'), ((157, 157), 'He'), ((178, 178), 'he'), ((189, 189), 'He'), ((218, 218), 'his'), ((241, 241), 'Musk'), ((282, 282), 'Musk'), ((297, 297), 'he'), ((308, 308), 'he'), ((330, 330), 'He'), ((350, 350), 'he'), ((374, 374), 'he'), ((396, 396), 'he'), ((427, 427), 'He'), ((445, 445), 'Musk'), ((484, 484), 'Musk'), ((513, 513), 'he'), ((530, 530), 'he'), ((541, 541), 'he'), ((553, 553), 'he'), ((558, 558), 'his'), ((566, 566), 'he'), ((573, 573), 'him')] [((32, 32), 'SpaceX'), ((284, 303), 'SpaceX , an aerospace manufacturer and space transport services company , of which he serves as CEO and Chief Engineer')] [((211, 216), 'the web software company Zip2'), ((223, 224), 'The startup')] [((235, 235), '1999'), ((237, 239), 'The same year')] [((245, 260), 'online bank X.com , which merged with Confinity in 2000 to form PayPal'), ((262, 263), 'The company')] [((269, 269), '2002'), ((280, 280), '2002')] [((314, 328), 'electric vehicle manufacturer Tesla Motors , Inc. ( now Tesla , Inc. )'), ((332, 332), 'its'), ((365, 365), 'Tesla'), ((539, 539), 'Tesla'), ((559, 559), 'Tesla')] [((517, 525), 'the US Securities and Exchange Commission ( SEC )'), ((544, 545), 'the SEC')]

The tokenized list is this: ['Elon', 'Reeve', 'Musk', 'FRS', '(', 'born', 'June', '28', ',', '1971', ')', 'is', 'a', 'business', 'magnate', 'and', 'investor', '.', 'He', 'is', 'the', 'founder', ',', 'CEO', ',', 'and', 'Chief', 'Engineer', 'at', 'SpaceX', ';', 'angel', 'investor', ',', 'CEO', ',', 'and', 'Product', 'Architect', 'of', 'Tesla', ',', 'Inc.', ';', 'founder', 'of', 'The', 'Boring', 'Company', ';', 'and', 'co', '-', 'founder', 'of', 'Neuralink', 'and', 'OpenAI', '.', 'With', 'an', 'estimated', 'net', 'worth', 'of', 'around', 'US$', '203', 'billion', 'as', 'of', 'June', '2022,[4', ']', 'Musk', 'is', 'the', 'wealthiest', 'person', 'in', 'the', 'world', 'according', 'to', 'both', 'the', 'Bloomberg', 'Billionaires', 'Index', 'and', 'Forbes', "'", 'real', '-', 'time', 'billionaires', 'list.[5][6', ']', '\n\n', 'Musk', 'was', 'born', 'to', 'White', 'South', 'African', 'parents', 'in', 'Pretoria', ',', 'where', 'he', 'grew', 'up', '.', 'He', 'briefly', 'attended', 'the', 'University', 'of', 'Pretoria', 'before', 'moving', 'to', 'Canada', 'at', 'age', '17', ',', 'acquiring', 'citizenship', 'through', 'his', 'Canadian', '-', 'born', 'mother', '.', 'He', 'matriculated', 'at', 'Queen', "'s", 'University', 'and', 'transferred', 'to', 'the', 'University', 'of', 'Pennsylvania', 'two', 'years', 'later', ',', 'where', 'he', 'received', 'bachelor', "'s", 'degrees', 'in', 'Economics', 'and', 'Physics', '.', 'He', 'moved', 'to', 'California', 'in', '1995', 'to', 'attend', 'Stanford', 'University', 'but', 'decided', 'instead', 'to', 'pursue', 'a', 'business', 'career', ',', 'co', '-', 'founding', 'the', 'web', 'software', 'company', 'Zip2', 'with', 'his', 'brother', 'Kimbal', '.', 'The', 'startup', 'was', 'acquired', 'by', 'Compaq', 'for', '$', '307', 'million', 'in', '1999', '.', 'The', 'same', 'year', ',', 'Musk', 'co', '-', 'founded', 'online', 'bank', 'X.com', ',', 'which', 'merged', 'with', 'Confinity', 'in', '2000', 'to', 'form', 'PayPal', '.', 'The', 'company', 'was', 'bought', 'by', 'eBay', 'in', '2002', 'for', '$', '1.5', 'billion', '.', '\n\n', 'In', '2002', ',', 'Musk', 'founded', 'SpaceX', ',', 'an', 'aerospace', 'manufacturer', 'and', 'space', 'transport', 'services', 'company', ',', 'of', 'which', 'he', 'serves', 'as', 'CEO', 'and', 'Chief', 'Engineer', '.', 'In', '2004', ',', 'he', 'was', 'an', 'early', 'investor', 'in', 'electric', 'vehicle', 'manufacturer', 'Tesla', 'Motors', ',', 'Inc.', '(', 'now', 'Tesla', ',', 'Inc.', ')', '.', 'He', 'became', 'its', 'chairman', 'and', 'product', 'architect', ',', 'eventually', 'assuming', 'the', 'position', 'of', 'CEO', 'in', '2008', '.', 'In', '2006', ',', 'he', 'helped', 'create', 'SolarCity', ',', 'a', 'solar', 'energy', 'company', 'that', 'was', 'later', 'acquired', 'by', 'Tesla', 'and', 'became', 'Tesla', 'Energy', '.', 'In', '2015', ',', 'he', 'co', '-', 'founded', 'OpenAI', ',', 'a', 'nonprofit', 'research', 'company', 'promoting', 'friendly', 'artificial', 'intelligence', '(', 'AI', ')', '.', 'In', '2016', ',', 'he', 'co', '-', 'founded', 'Neuralink', ',', 'a', 'neurotechnology', 'company', 'focused', 'on', 'developing', 'brain', '–', 'computer', 'interfaces', ',', 'and', 'founded', 'The', 'Boring', 'Company', ',', 'a', 'tunnel', 'construction', 'company', '.', 'He', 'agreed', 'to', 'purchase', 'the', 'major', 'American', 'social', 'networking', 'service', 'Twitter', 'in', '2022', 'for', '$', '44', 'billion', '.', 'Musk', 'has', 'proposed', 'the', 'Hyperloop', ',', 'a', 'high', '-', 'speed', 'vactrain', 'transportation', 'system', ',', 'and', 'is', 'the', 'president', 'of', 'the', 'Musk', 'Foundation', ',', 'an', 'organization', 'which', 'donates', 'to', 'scientific', 'research', 'and', 'education', '.', '\n\n', 'Musk', 'has', 'been', 'criticized', 'for', 'making', 'unscientific', 'and', 'controversial', 'statements', ',', 'such', 'as', 'spreading', 'misinformation', 'about', 'the', 'COVID-19', 'pandemic', '.', 'In', '2018', ',', 'he', 'was', 'sued', 'by', 'the', 'US', 'Securities', 'and', 'Exchange', 'Commission', '(', 'SEC', ')', 'for', 'falsely', 'tweeting', 'that', 'he', 'had', 'secured', 'funding', 'for', 'a', 'private', 'takeover', 'of', 'Tesla', ';', 'he', 'settled', 'with', 'the', 'SEC', 'but', 'did', 'not', 'admit', 'guilt', ',', 'and', 'he', 'temporarily', 'stepped', 'down', 'from', 'his', 'Tesla', 'chairmanship', '.', 'In', '2019', ',', 'he', 'won', 'a', 'defamation', 'case', 'brought', 'against', 'him', 'by', 'a', 'British', 'caver', 'who', 'had', 'advised', 'in', 'the', 'Tham', 'Luang', 'cave', 'rescue', '\n ']

As can be seen, the 2nd item in the first cluster refers to 'he' with an index of 21 whereas its index in the orig_tokens list is 18.

Can you please explain what is there this misalignment?

mithunb commented 2 years ago

Ok. I found the subtoken_map in the output which correctly maps to the correct index in the list referred to by orig_tokens