nvandeweerd / fsca

fsca is an open-source R package for the extraction of syntactic units from dependency-parsed French texts. Please refer to the README file for licensing information.
Other
4 stars 0 forks source link

FSCA with other parser/taggers #1

Open gloignon opened 1 year ago

gloignon commented 1 year ago

Hello professor Vandeweerd,

I've been working on a pipeline for French analysis for a while now. It currently uses rsyntax to extract noun and verb phrases but I've been looking to replace rsyntax with FSCA.

Are there plans to adapt the FSCA package for more commonly used parser/taggers in the future (e.g. udpipe or spaCy) ?

Regards, GL

nvandeweerd commented 1 year ago

Hi Guillaume,

Thanks for your interest in fsca. I don’t have any immediate plans to adapt the script to work on UD/spaCy output for time being but this is definitely something that would be worthwhile in the future. That being said, it shouldn’t require too much work since it already accepts CONLL data frames so the main thing that would need to be adjusted is the POS tags within the script itself.

That being said, are you aware that spaCy now has the possibility to extract noun phrases directly? https://spacy.io/usage/linguistic-features#noun-chunks This may help to simplify your pipeline. Although there is not a direct function to extract verb phrases, by navigating the parse tree, it should also be possible to those relatively easily as well. I haven’t tested this empirically yet for French so I can’t speak to its reliability however.

I’m curious to hear about the pipeline that you are developing. Do you happen to have any more information on that?

Kind regards,

— Nathan Vandeweerd PhD he/him Assistant Professor | Department of Language and Communication | Radboud University Nijmegen Erasmus Building E 4.08, Erasmusplein 1 Websitehttps://www.ru.nl/personen/vandeweerd-n | LinkedInhttps://www.linkedin.com/in/nathanvandeweerd/ | ResearchGatehttps://www.researchgate.net/profile/Nathan-Vandeweerd | GitHubhttps://github.com/nvandeweerd | Twitterhttps://twitter.com/nvandew

On Jan 14, 2023, at 18:02, Guillaume Loignon @.***> wrote:

Hello professor Vandeweerd,

I've been working on a pipeline for French analysis for a while now. It currently uses rsyntax to extract noun and verb phrases but I've been looking to replace rsyntax with FSCA.

Are there plans to adapt the FSCA package for more commonly used parser/taggers in the future (e.g. udpipe or spaCy) ?

Regards, GL

— Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnvandeweerd%2Ffsca%2Fissues%2F1&data=05%7C01%7Cnathan.vandeweerd%40uclouvain.be%7Cb3ba389bf95a4420195608daf65125af%7C7ab090d4fa2e4ecfbc7c4127b4d582ec%7C0%7C0%7C638093125628725206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=RBNvzzRdbpjEAL2j%2FHetlkFXDnMDQKicIgv4xtTddss%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FARDVTVIGPSAR76KSTPPIW7DWSLL27ANCNFSM6AAAAAAT3LV6B4&data=05%7C01%7Cnathan.vandeweerd%40uclouvain.be%7Cb3ba389bf95a4420195608daf65125af%7C7ab090d4fa2e4ecfbc7c4127b4d582ec%7C0%7C0%7C638093125628725206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=wmsu4SZEVF5FAiLGHRwzHr6bvUjlv3sFTXyhtG1oRi0%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.***>

gloignon commented 1 year ago

Thank you, I will look into the np extractor in spaCy. The French version of spaCy used to perform really poorly for pos tagging (hence my choice of UDpipe) but I could always use both parsers in parallel then pick the features I need. Since I also want T-units and verb phrases, your suggestion of adapting the tags in fcsa makes a lot of sense too.

The ALSI (Analyseur lexico-syntaxique intégré) pipeline was part of my phd project. The resulting paper is at: https://doi.org/10.7202/1093065ar I am cleaning/commenting the code and plan to host ALSI on github later this year.