scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction
GNU General Public License v3.0
27 stars 68 forks source link

Simplify project's emoji keyword functionality #359

Open andrewtavis opened 1 week ago

andrewtavis commented 1 week ago

Terms

Description

Something that should be changed about the project is the way that the emoji keyword functionality works. Basically all of the files in question are the same except for a few variables, and there are already CLI arguments being passed to these files. We do want the structure of the package to determine the functionality of the project, but then this is a case where there's really no benefit of repetition as there is for the queries where they serve has a record for how to get data from Wikidata via SPARQL.

Some ideas:

Contribution

Thoughts on this would be very appreciated! Happy tor review and work with people on this 😊

andrewtavis commented 1 week ago

CC @DeleMike, @catreedle, @KesharwaniArpita and @VNW22 for the initial discussion 😊

KesharwaniArpita commented 1 week ago

Sounds good to me> The emoji keyword functionality is essentially the same for all languages, so centralizing it makes sense. @andrewtavis Can I be assigned?

DeleMike commented 1 week ago

This is a great issue! @andrewtavis. I imagine we want to update something related to emojis and we would have to go through all the directories!

As you suggested, having a single point of call for the emojis is important. I went through the files, I believe gen_emoji_lexicon in src/scribe_data/unicode/process_unicode.py will be an important function.

This is just a shallow thought for now. I would love to contribute in any way to resolve this issue.

andrewtavis commented 1 week ago

Let's leave this issue for a bit and continue the conversation on it. We all have a lot being worked on right now, so let's close some current issues and then we can plan the work from there :)

Thanks for your interest in helping!

catreedle commented 1 week ago

I agree with centralizing the functionality, it’ll certainly save a lot of repetitive work! I'm curious, will the emoji generation remain uniform across languages, or is there a plan to account for different linguistic forms, such as gender variations?

andrewtavis commented 1 week ago

Gendered emojis should be coming out in the current setup as it's Unicode's words that are associated with a given emoji ordered by their usage and then the top X - usually 3 - are selected :) We can keep this in mind for later though 😊

catreedle commented 1 week ago

I see. Then I think there shouldn't be any issue with centralizing it :)

Ekikereabasi-Nk commented 1 week ago

Hi @andrewtavis @catreedle @KesharwaniArpita @VNW22 @DeleMike I'm interested in joining the team

catreedle commented 1 week ago

Hi @andrewtavis @catreedle @KesharwaniArpita @VNW22 @DeleMike I'm interested in joining the team

Welcome! I think we're still very much in the discussion phase. Looking forward to collaborating with you! :)

Ekikereabasi-Nk commented 1 week ago

Hi @andrewtavis @catreedle @KesharwaniArpita @VNW22 @DeleMike I'm interested in joining the team

Welcome! I think we're still very much in the discussion phase. Looking forward to collaborating with you! :)

Thank you so much. Has the discussion started? Is the discussion on the element app?

andrewtavis commented 1 week ago

I was going to suggest this to you, @Ekikereabasi-Nk :) We're discussing it in the issue right now. Can you look at the emoji keyword functionality and make a suggestion on how to centralize this functionality into the src/scribe_data/unicode directory? :)

Ekikereabasi-Nk commented 1 week ago

To achieve a centralize functionality I suggest the steps:

So, how do you all see this suggestion? @andrewtavis

KesharwaniArpita commented 1 week ago

Next, we will need to modify the emoji language file for each language to import the centralized function from step 1 and create a simplify code

Hi @andrewtavis, @DeleMike , @catreedle. @Ekikereabasi-Nk, Do you think we should modify the __init__.py files to import and call centralized function, passing in the appropriate variables as arguments (e.g., language and emoji-specific variables)? It will be able to cater the grouped languages (SA Hindustani and Norweign etc) too and any other specific required customization.

andrewtavis commented 1 week ago

I'm generally thinking that we follow @Ekikereabasi-Nk's suggestions here and maybe keep the empty __init__.py files as a means of keeping the functionality from the project structure, but more the full process is done in the unicode directory. I'm actually not sure what languages Unicode has support for, so maybe that's something that we could explore a bit - i.e. what languages are included in the CLDR dataset. There's no better source of this information, and with this we'd know to just put an __init__.py file in the directories for those languages that we find have emoji support. What's more, another check could be written to find which languages do have support and make sure that each of them and only them have an __init__.py file :)

andrewtavis commented 1 week ago

A basic thing is that the __init__.py files should remain empty as this is Python packaging convention. They should make it easier to load something with a different name or do nothing, as I understand it.

KesharwaniArpita commented 1 week ago

Thanks for the feedback! I agree with the idea of keeping the init.py files as a Python packaging convention, especially to maintain the project's structure and potentially assist with language-specific functionality loading.

Regarding the suggestion of using the CLDR dataset to check which languages have emoji support, that sounds like a great idea. It will ensure we're only including relevant languages in the directories.

VNW22 commented 1 week ago

heyy, I'm kinda late but i'd like to join in the discussion :)

andrewtavis commented 1 week ago

By all means, @VNW22! Let's try to get to this soon :) @Ekikereabasi-Nk, do you want to open a PR for this and the others can review?

VNW22 commented 1 week ago

I fully support the plan to centralize the emoji-keyword functionality by moving the shared logic to src/scribe_data/unicode—this will streamline the process and reduce redundancy. It seems like a solid solution has emerged from the discussion so far, but I’d be happy to assist with any part of the refactoring or the exploration of the CLDR dataset to ensure we cover all relevant languages.

andrewtavis commented 1 week ago

Do you want to look into the script to check that we have emoji support for all languages that we can and don't for those that we shouldn't, @VNW22? You'd need to do the setup for CLDR, which is difficult to do on Windows (if that's your operating system, then you'll likely need WSL to run the emoji programs on a Linux machine).

Let us know!

Ekikereabasi-Nk commented 1 week ago

By all means, @VNW22! Let's try to get to this soon :) @Ekikereabasi-Nk, do you want to open a PR for this and the others can review?

Alright @andrewtavis

KesharwaniArpita commented 6 days ago

@Ekikereabasi-Nk and @andrewtavis , I wanted to rewrite the code for the language emoji files. I think we can start collaborating on the code. While @Ekikereabasi-Nk is working on the centralized script, is it alright that I start working on the function call for the languages? We can make the minor changes later too?

Ekikereabasi-Nk commented 6 days ago

@Ekikereabasi-Nk and @andrewtavis , I wanted to rewrite the code for the language emoji files. I think we can start collaborating on the code. While @Ekikereabasi-Nk is working on the centralized script, is it alright that I start working on the function call for the languages? We can make the minor changes later too?

Sure @KesharwaniArpita I'm also through with the centralize function

andrewtavis commented 6 days ago

Feel free to send along PRs and we'll see on both ends :)

VNW22 commented 6 days ago

Do you want to look into the script to check that we have emoji support for all languages that we can and don't for those that we shouldn't, @VNW22? You'd need to do the setup for CLDR, which is difficult to do on Windows (if that's your operating system, then you'll likely need WSL to run the emoji programs on a Linux machine).

Let us know!

is it possible on mac?

VNW22 commented 6 days ago

Do you want to look into the script to check that we have emoji support for all languages that we can and don't for those that we shouldn't, @VNW22? You'd need to do the setup for CLDR, which is difficult to do on Windows (if that's your operating system, then you'll likely need WSL to run the emoji programs on a Linux machine).

Let us know!

okay, I'll be working on it

andrewtavis commented 6 days ago

Sorry I was planning on sending along an explanation here, @VNW22, but got caught up with things :)

Is much easier on Mac and Linux. Specifically we to have a guide for this here. Let me know if anything is confusing and we can update the guide!

Thanks for looking into this 😊

Ekikereabasi-Nk commented 5 days ago

Thanks @KesharwaniArpita for the work here https://github.com/scribe-org/Scribe-Data/pull/397

VNW22 commented 5 days ago

Sorry I was planning on sending along an explanation here, @VNW22, but got caught up with things :)

Is much easier on Mac and Linux. Specifically we to have a guide for this here. Let me know if anything is confusing and we can update the guide!

Thanks for looking into this 😊

no worries :) looking into it