scribe-org / Scribe-iOS

iOS app with keyboards for language learners
https://apps.apple.com/app/scribe-language-keyboards/id1596613886
GNU General Public License v3.0
126 stars 78 forks source link

Convert JSON data to SQLite #96

Closed andrewtavis closed 1 year ago

andrewtavis commented 2 years ago

Terms

Behavior

Thereโ€™s currently a bug with v1.1.0 where the Russian keyboard will not translate words. The keyboard switches to English as it should, and words can be typed, but then on pressing return the keyboard crashes without anything being inserted.

Device type

iPhone 7

iOS version

iOS 15.1

andrewtavis commented 2 years ago

Is not reproducible on an iPhone 7 in Simulator, but then the version is 15.2.

Other translation features work fine, so shouldn't be an issue with accessing the data as the size is relatively similar (although Russian has the largest translation file by 0.6MB).

japanese-goblinn commented 2 years ago

I may take a look on this issue

andrewtavis commented 2 years ago

This would be absolutely amazing! ๐Ÿš€๐Ÿš€ I really haven't had the bandwidth to dedicate time to this, but it's the main bug right now that dramatically affects the functionality. Let me know if you have ideas on this!

japanese-goblinn commented 2 years ago

Yep, assign me on this please and I'll let you know on my progress with this issue

andrewtavis commented 2 years ago

Assigned! Thank you ๐Ÿ˜Š

andrewtavis commented 2 years ago

@japanese-goblinn, reporting that apparently v2.0 Made this even worse, as basically the Russian keyboard crashes from the start at this point ๐Ÿคฆโ€โ™‚๏ธ This is not happening on the emulator, and sadly we donโ€™t have device level testing happening at this point. The crashes that other keyboards were experiencing have actually gotten better.

Iโ€™m wondering if this is maybe an issue with the amount of data that Russianโ€™s ingesting? All the other keyboards have similar amounts of data, but Russian has those almost 200k bound. Could be crashing because it canโ€™t handle it?

japanese-goblinn commented 2 years ago

@andrewtavis I've looked intro this crash and will soon describe here what I've found

andrewtavis commented 2 years ago

Thanks so much, @japanese-goblinn! Really looking forward to what youโ€™ve figured out :)

andrewtavis commented 2 years ago

@japanese-goblinn, note that I just checked and the new "lack of" behavior is reproducible sometimes on the Simulator, I just didn't check Russian after all these changes. Am looking into this as well now :)

japanese-goblinn commented 2 years ago

Well, I managed to get this error while debugging, but not really sure how to reproduce it (screenshots below). I want to take a deeper look into this bug but unfortunately I don't have time for now.

This is what I found so far:

You sad that file is about 0.6 MB when it's actually 15.1 MB. Here's code how I measured it

// func loadJSONToDict
let data = try Data(contentsOf: url)
let f = ByteCountFormatter()
f.allowedUnits = [.useMB]
f.countStyle = .memory  
print("[DEBUG]: file \(fileName) size \(f.string(fromByteCount: Int64(data.count)))")

It's not much but it's a measurement of raw data, so it's less that Dictionary will occupy (not sure how bad it is, you need to figure it out yourself).

I also tried to use profiler but that was not really helpful, it's only showed that a really slow disc read is happening but you can see it yourself - keyboard is taking 2-3 seconds to startup (at least in dev mode).

As a conclusion I'm not sure if slow app startup is causing sometimes to crash, because you can't be absolutely sure if system will kill your application due to slow start. What I'm absolutely sure about is that you need to rework your data structures and whole approach of working with data (read only batches not whole file, using DB, etc.) and I can sure your that application will become event better and hopefully you'll manage to fix this bug!

I hope you luck in this hard task and thank you for possibility to contribute!

IMAGE 2022-10-13 20:14:23 IMAGE 2022-10-13 20:14:26

andrewtavis commented 2 years ago

@japanese-goblinn, wanted to thank you for your contributions! #231 is an initial step towards working on this in my opinion, as we really should be leveraging a library for JSON loading. Hopefully this will make the read only batches you suggested or another solution for the data process easier going forward ๐Ÿ˜Š

All the best, and your contributions are welcome whenever you have time! :)

andrewtavis commented 2 years ago

We'll likely need to be using Core Data for this in the future, but a WIP step to help the situation would be to remove the indentation from all JSON files to make them smaller in size.

andrewtavis commented 2 years ago

@SaurabhJamadagni, FYI for this issue, I have the JSON files formatted to remove spaces, which really does help as far as app size is concerned. Will push those with the QWERTY French keyboard for #229 :)

It sadly doesnโ€™t seem to help to much with load times, as itโ€™s likely the crazy number of lookups being done thatโ€™s problematic. Iโ€™m going to a Berlin Swift coding group on Thursday and will discuss with people how best to implement Core Data or another solution that will help us now directly load all these dictionaries into memory all at once ๐Ÿ˜Š

SaurabhJamadagni commented 2 years ago

It sadly doesnโ€™t seem to help to much with load times, as itโ€™s likely the crazy number of lookups being done thatโ€™s problematic. Iโ€™m going to a Berlin Swift coding group on Thursday and will discuss with people how best to implement Core Data or another solution that will help us now directly load all these dictionaries into memory all at once

That sounds good @andrewtavis! This should be a huge performance boost in that case. I'll try to see if I can find any answers as well ๐Ÿ˜Š

andrewtavis commented 2 years ago

Iโ€™ve already been watching some videos on all this after you suggested it, @SaurabhJamadagni :) Doesnโ€™t seem like itโ€™d be too much of a jump as itโ€™s so well integrated into normal Swift workflows. Letโ€™s be in touch!

andrewtavis commented 1 year ago

A recap of what came from the recent iOS meetup I attended in relation to this issue, @SaurabhJamadagni:

The suggestion from everyone I talked to was actually not to go the route of Core Data, but rather to add an SQLite instance to the app. Generally the sentiment was that Core Data is an easy Apple backed solution, but will cause problems later on. Looking at it, it seems that we could make use of the following options:

Here are some tutorials as well:

SaurabhJamadagni commented 1 year ago

The suggestion from everyone I talked to was actually not to go the route of Core Data, but rather to add an SQLite instance to the app.

That works too @andrewtavis. I was also interested in the alternatives as I used CoreData once and although powerful, it's showing a little age in my opinion. I could be wrong as there's so much more I didn't use the one time I gave it a shot. I'll go through the resources you have shared asap. Thanks for sharing these! The meetup seems to have had a very productive output I assume ๐Ÿ˜Š

Also I am back, so should get to the French PR soon :)

andrewtavis commented 1 year ago

Looking forward to the PR and learning about all this, @SaurabhJamadagni! The meetups are quite fun, yes :) Really nice people go to the Berlin ones ๐Ÿ˜Š

andrewtavis commented 1 year ago

Further feedback on from a coder from the iOS meetup here in Berlin is that the best option would be to go with SQLite.swift. I still haven't had the time to go through the tutorials for that, but will start looking into them when we're done with #208 and #248 and v2.1.0 is released ๐Ÿ˜Š

andrewtavis commented 1 year ago

@SaurabhJamadagni, I added a roadmap to the readme contributors section - just FYI. Would be great to get your feedback on that! I've been having an extended discussion about it in https://github.com/scribe-org/Scribe-Data/issues/14 (summary of that: using the Python emoji package doesn't make sense, we're going directly from Unicode sources for the words we want emoji suggestions for, and the goal is to get JSONs where we have word-Emoji pairs using the Unicode annotation JSON files).

Again, your feedback on the roadmap would be very very welcome ๐Ÿ˜Š

SaurabhJamadagni commented 1 year ago

I added a roadmap to the readme contributors section

The roadmap looks great @andrewtavis. I was wondering if adding the English keyboard should be done sooner, but it is best to keep it for a major version update. Although, is there a different place where we can move the bullet points for Scribe-Android and Scribe-Data instead of inside the Scribe-iOS repo? Just nitpicking, else the points look golden! ๐Ÿš€

I've been having an extended discussion

I tried to read through it. Can't say I understood everything ๐Ÿ˜… but I am assuming there are interoperability issues between going from Unicode to json for emojis? Maybe we could discuss this during our call.

andrewtavis commented 1 year ago

Although, is there a different place where we can move the bullet points for Scribe-Android and Scribe-Data instead of inside the Scribe-iOS repo?

The general idea now is to start using the new version of GitHub projects and just link to them that way :) Makes sense to me that it's the best way of organizing everything going forward, and that way we'd just link to the projects in order of their importance.

As far as a call, what day of the week would you be available? Wednesday or Thursday evening? I'm thinking I'm gonna get some designs for the app and #16 in tonight, so we can go through those a bit as well ๐Ÿ˜Š

SaurabhJamadagni commented 1 year ago

As far as a call, what day of the week would you be available? Wednesday or Thursday evening?

Thursday works @andrewtavis. What time should we have the call? Would 17:00 UTC do?

andrewtavis commented 1 year ago

17 UTC works, @SaurabhJamadagni ๐Ÿ˜Š Sending along an invite :)

andrewtavis commented 1 year ago

@austinate, @SaurabhJamadagni and I have been discussing how to implement SQLite.swift, and had a few questions ๐Ÿ™‚ Where we're at is:

andrewtavis commented 1 year ago

Changed the name of this issue so that it's more reflective of the current scope ๐Ÿ™ƒ

andrewtavis commented 1 year ago

As stated in #286, a lot of what was problematic in here was coming from the way the autosuggestions lexicon was being made :) All keyboards are loading much faster at this point except Russian โ€” the original problematic one because of its 194k noun dataset โ€” which is still loading slowly. On an emulator it is loading though, which before #286 was not the case.

This issue and #284 are now the last issues before we can finally do the v2.2.0 release with so many new additions ๐Ÿ˜Š๐Ÿš€

SaurabhJamadagni commented 1 year ago

This issue and https://github.com/scribe-org/Scribe-iOS/issues/284 are now the last issues before we can finally do the v2.2.0

AWESOME! ๐Ÿš€
What's the plan on this one @andrewtavis? How are we going to proceed?

andrewtavis commented 1 year ago

I'm gonna start looking at the documentation for SQLite.swift GRDB.swift and then try to wrap my head around it all, @SaurabhJamadagni. Really would be good to get this done as there's so much that could go out at this point ๐Ÿ˜Š

andrewtavis commented 1 year ago

I got into a bit of a groove over the last day and a half ๐Ÿ˜‡๐Ÿ˜„

andrewtavis commented 1 year ago

@SaurabhJamadagni, if you also want to look into SQLite.swift GRDB.swift that would be really great ๐Ÿ™๐Ÿ˜Š Then each of us can put any ideas that we get into this issue and we can then make a plan to switch it all over. There are tons of places where data is being referenced, so I think the big thing here is to maybe make the tables and load the data into them first, and then switch over the app table by table while checking that everything works (so first nouns, then verbs, etc).

Let me know how this sounds :) :)

andrewtavis commented 1 year ago

@SaurabhJamadagni, I edited the above links and changed them to GRDB.swift as I think I'd like to go with that instead of SQLite.swift. Reasons for this are that the original maintainer is still working on GRDB.swift, these two Reddit pages (link, link), its been benchmarked against other solutions and won (maybe biased, but at least they did benchmark it), the issues are really well maintained (like I thought oh it's not used so much cuz there are no issues and then I was like ๐Ÿ˜ฎ), and I generally really like the care that they put into all the docs that are right in the readme (I've already edited our readme a bit to borrow some of the stuff they do ๐Ÿ˜Š).

Hope this sounds good to you! :)

andrewtavis commented 1 year ago

Notes on this (edited):

andrewtavis commented 1 year ago

@wkyoshida, can I ask you to weigh in on this from the Scribe-Data side/your experience? Also given our discussion in https://github.com/scribe-org/Scribe-Data/issues/26, it'd be great if you'd share your opinions :)

A quick rundown for you:

Questions I'm thinking about now:

Your feedback and any general ideas would be very appreciated! ๐Ÿ˜Š

andrewtavis commented 1 year ago

Update on this :) sqlite3 and Python's json seem to be doing the trick ๐Ÿ˜Š I'm slowly but steadily creating a file in Scribe-Data that parses through JSONs and creates a unified language .sqlite database. Note that SQLite Viewer is really useful for checking the contents of the databases from within VS Code :)

Last thing is figuring out how to auto-generate the column names for verb tables based on what's in the JSON data, which should be pretty standard. Maybe will need minor changes in iOS if the naming conventions aren't consistent ๐Ÿ˜‡

Thing to do:

Edit: rest of issues moved to a lower comment :)

andrewtavis commented 1 year ago

077e346 adds the SQLite databases ๐Ÿ˜Š๐Ÿ˜Š๐Ÿ˜Š๐Ÿ˜Š๐Ÿ˜Š This SQLite stuff is great :) Wish I'd done this from the start! I'll work on Scribe-Data a bit more today and commit my changes later after I'm done switching over the process to not update the Data directories in Scribe-iOS, but rather local formatted_data directories in Scribe-Data.

We can now start experimenting with GRDB.swift and making references to the SQLite databases, which hopefully will go well ๐Ÿคž

andrewtavis commented 1 year ago

Note that for the databases, if there is an entry that doesn't exist the NAN value is an empty string (""). This means that some Scribe-iOS features like autosuggestions and emojis will need to be changed, as I think as of now they check to see if a value exists or not rather than if it's "".

SaurabhJamadagni commented 1 year ago

@SaurabhJamadagni, if you also want to look into SQLite.swift GRDB.swift that would be really great

Hey sorry @andrewtavis. Was a bit occupied with my college semester project. I'll read up too absolutely. I am sorry for the delay. You got so much done though that's insane! Have you pushed it to a remote branch that I can take a look at too?

andrewtavis commented 1 year ago

Hey @SaurabhJamadagni! No stress or delay at all :) As I said I just got into a bit of a flow on all this and went at it ๐Ÿ˜Š The databases are already in the main branch in each of the Keyboards/LanguageKeyboards directories, and the Scribe-Data stuff will be in there later :) Feel free to read up a bit and start testing out accessing the databases ๐Ÿ˜Š

SaurabhJamadagni commented 1 year ago

Taking a look at the sqllite files for the languages @andrewtavis. The data isn't referenced anywhere yet right? Looks so much more organized compared to the JSONs haha. So, you wrote the function that converts JSONs to sqlite? Is it in Scribe-Data?

So if I am understanding the transition correctly, instead of loading the arrays from JSON files we will query the database instead. But the arrays will still be formed? Or will we be performing smaller but specific queries wherever necessary and skip on creating the larger arrays completely?

For example, will data be loaded in the command variables like nouns or are we directly querying them inside the code where required?

andrewtavis commented 1 year ago

@SaurabhJamadagni, the data isnโ€™t referenced anywhere yet :) We can each try to see how using it goes before deciding on how exactly we want to use it. And yes the functions to convert the JSONs over are in my local Scribe-Data ๐Ÿ˜Š Will push them later.

Or will we be performing smaller but specific queries wherever necessary and skip on creating the larger arrays completely?

Itโ€™s a good question! I think if we can do specific queries that would be best, but letโ€™s decide on this later ๐Ÿ˜Š

SaurabhJamadagni commented 1 year ago

Will push them later.

No rush @andrewtavis, was just curious. Sounds good :)

andrewtavis commented 1 year ago

and remove SwiftyJSON from dependencies

Noting that this should not be done as odds are we'll be using JSON still for the app texts ๐Ÿค” We'll be adopting a localization platform in #268, which likely will be translating the texts housed in JSONs :) I'm not sure if there's any easy way to do this with another file type, so for now it's fine to keep the JSONs for this purpose ๐Ÿ™ƒ

andrewtavis commented 1 year ago

A reaction to this comment makes me feel even better about our choice of SQLite dependency ๐Ÿ˜Š๐Ÿ˜Š

https://github.com/scribe-org/Scribe-Data/commit/d830ae527640a3b23983db187c8509efc65e31a4 committed the local work I had been doing to create the SQLite databases for the language keyboards. There is a lot of code in there as there was so much refactoring... I updated the Scribe-Data changelog with the things that I did, but specifically the files of interest are:

There are still a few more things to be done and I'll move some stuff around now that some files are more used for extract-transform purposes rather than for loading. The final tests/fixes of it all will be when I redo the data process for #284 ๐Ÿš€

andrewtavis commented 1 year ago

e1069fb updates the data based on the recently closed https://github.com/scribe-org/Scribe-Data/issues/30 :) I was realizing that when I typed a plural noun there sometimes wasn't an emoji suggestion when for the singular version there was. Ex: I type "Krokodil" and I see the crocodile emoji, but typing "Krokodile" showed nothing, which I found weird. https://github.com/scribe-org/Scribe-Data/issues/30 looked for cases where a plural noun was not a key in the emoji keyword data, and when not the emojis from its singular form were used.

andrewtavis commented 1 year ago

As the comment with the to do list shows, I'm mostly done with the Scribe-Data work except for a final test during #284 ๐Ÿ”ฅ

We're ready to start switching over Scribe-iOS to the new databases ๐Ÿ˜Š๐Ÿ˜Š My thoughts on this would be trying to do direct queries where we can. We'd have strings like prefix for autosuggestions/completions or wordToTranslate and we can then try to do a direct query of the database for the information we want. If the data doesn't exist then we handle it by passing in case of auto actions or displaying an error in case of commands :)

I'd say it makes sense to start with prepositions as they're only for German and Russian. As soon as that's working we can start with some of the others, with translations being the next easiest as the data just has two columns. Translations from languages that aren't English are gonna be so much easier now that we can just make a database for it all ๐Ÿ™Œ

wkyoshida commented 1 year ago

Hey all! I'm catching up here on all the new work.. @andrewtavis was definitely on a roll :laughing:

Some thoughts:

emoji_keywords table in DELanguageData.sqlite word emoji_1 emoji_2 emoji_3
gesicht 34 66 71
emojis table in emojis.sqlite id emoji is_base
34 ๐Ÿ˜‚ false
66 ๐Ÿคฃ false
71 ๐Ÿ˜ญ false
andrewtavis commented 1 year ago

Hey @wkyoshida ๐Ÿ‘‹ Thanks for the feedback and the thoughts! ๐Ÿ˜Š

They are currently stored as word-translation (i.e. english-to-target language pairs) within each language's .sqlite. Instead, could putting all the languages together make sense? i.e. A separate .sqlite with a table with rows, english, german, french, and so on?

I'd say this would be once Wikidata gets to a point where we can get translations from there, right? That's when we can actually just get a single table. As of now we'll just be downloading as many words as we can and machine translating them, so I think source language based .sqlite files make sense where all the words we can get are primary keys, and all the translations from Hugging Face are the various columns via https://github.com/scribe-org/Scribe-Data/issues/23 as you mentioned.

One idea could be to have a separate .sqlite with a table with emoji metadata. Some rows could be id, emoji, is_base. Then, in the emoji_keywords tables of each language, instead of having the actual emojis, the tables could have the id.

This makes sense and seems to be where we'd go for #271 :)

Would the apps just be downloading the .sqlite files? Or as .jsons or something else? Based on the question below from @andrewtavis, I would guess .sqlite?

I'd also say .sqlite for now, but we can discuss this a bit later as you and I had also been discussing last_updated as a part of this so that the user's data update process could be much faster. Selective updating would also likely be easiest done via .sqlite, or would you suggest something else?

i.e. In thinking of Scribe-Data as a general-use package, would it make sense to have that as an option (i.e. the user can choose to either generate as .json or .sqlite)? If maybe not, then where should the JSON-to-SQLite conversion happen? On the app-side?

I'd say definitely as we extrapolate out we can keep the option for JSONs and .sqlite file exports ๐Ÿ˜Š Why not, if it's not too hard to maintain, as by the looks of it we're doing .sqlite for the app and Wikidata exports are JSONs or other formats that we wouldn't necessarily use in app, so data prep can happen while we're still in JSONs and eventually that or other file types can be exported? Scribe-Server side for sure though :) Apps should either be just getting a whole new updated .sqlite files from this point on, or update certain rows of it based on last_updated, but let me know if this sentence makes sense ๐Ÿ™

The data_to_sqlite.py works as an interim solution, but eventually, we can move that logic to Scribe-Server.

๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ™Œ

Really awesome you got it done so quick! ๐Ÿ™Œ

Thank you!! ๐Ÿ˜Š๐Ÿ˜Š๐Ÿ™ Now to get this here app updated and shipped ๐Ÿš€๐Ÿ˜Š

wkyoshida commented 1 year ago

Hey! :wave:

I'd say this would be once Wikidata gets to a point where we can get translations from there, right?

Yeah, I'm thinking this could wait until we're able to do cross-translations - in whatever way this is accomplished in https://github.com/scribe-org/Scribe-Data/issues/23. I was more so pointing out perhaps that the current translations tables with the word-translation rows do limit us to only the English-to-target language translation, which is fine for now of course, since that is the only translation data that we have.

Selective updating would also likely be easiest done via .sqlite, or would you suggest something else?

This is what I want to make sure to think through. I think there will be three steps that are more time-heavy or resource-heavy that happen when a data download is done:

  1. Scribe-Server generates the data pack
  2. Scribe-Server sends the pack to the client that requested the pack
  3. The client unpacks/processes the pack into the client

For these steps, the file format should facilitate an efficient download workflow. FWIW I am currently thinking .sqlite though, because, for the aforementioned three steps:

  1. Getting the data needed based off of last_updated I believe will be faster if Scribe-Server is already storing the data in a DB
  2. If using .json, this would mean having to send multiple .json per language. On the other hand, there is one .sqlite per language.
    • However, if multiple languages are getting downloaded, this would still mean multiple .sqlite
  3. If using .json, the client would have to retain the ability to read the data as .json to then write it to .sqlite

I'd say definitely as we extrapolate out we can keep the option for JSONs and .sqlite file exports blush Why not, if it's not too hard to maintain

I guess the reservation that I'm having is more so coming from putting myself in the perspective of a user of Scribe-Data (as a general-use package). Like, would I care enough about the .sqlite option? Would I use it? If I don't, would I be bothered with the extra bloat of SQLite functionality that comes packaged in?

Maybe it's fine actually. I'm not definitely opposed. I'm perhaps more trying to think from the mindset that, as a user of packages, I would like my packages to do what I need them to, be small and compact, and have a clear use/purpose - if that makes sense.

Apps should either be just getting a whole new updated .sqlite files from this point on, or update certain rows of it based on last_updated, but let me know if this sentence makes sense :pray:

I'm with you! :+1:

andrewtavis commented 1 year ago

I was more so pointing out perhaps that the current translations tables with the word-translation rows do limit us to only the English-to-target language translation, which is fine for now of course, since that is the only translation data that we have.

I'm definitely of the opinion that as soon as more options are available for translation that instead of translation as a column we'd have a language name which is what the word would be translated into. I think we're on the same page ๐Ÿ˜Š What'll need to happen is that the relationship between translations will need to be inverted based on what we have right now. Right now we have within the German files a file that links English to German, but what will be there instead will be a file where word is German words and each of the columns is what we've gotten as translations. Right now we are saving translation files with their target, but in the future it'd be those words that are typed on the keyboard that's switched to for during translation :)

Glad to hear that it sounds like .sqlite can be a solid foundation for the future! Let's keep discussing and readjusting, but for now we're sailing ahead โ›ตโžก๏ธ

Maybe it's fine actually. I'm not definitely opposed. I'm perhaps more trying to think from the mindset that, as a user of packages, I would like my packages to do what I need them to, be small and compact, and have a clear use/purpose - if that makes sense.

I think it also depends on who the people are that first want to use it for non-Scribe purposes, but definitely happy to remove the bloat and even have a lot of the multi-use stuff be in Server for a wider community where Scribe-Data is more just what we need :)

I'm with you! ๐Ÿ‘

๐Ÿš€๐Ÿš€๐Ÿš€๐Ÿ˜Š