pathwaycom / pathway

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
https://pathway.com
Other
4.34k stars 139 forks source link

Parser giving unicode characters for Arabic language #47

Closed abdul756 closed 2 months ago

abdul756 commented 6 months ago

Parser giving unicode characters for Arabic language how to parse files for languages other than english Am attaching the output

Code

files = pw.io.fs.read(
    data_dir,
    mode="streaming",
    format="binary",
    autocommit_duration_ms=50,
)
parser = ParseUnstructured()
documents = files.select(text = parser(pw.this.data))
pw.io.csv.write(documents, "output_stream_en_7.csv")

نظام التكاليف القضائية.pdf

output_stream_ar_7.csv Please help in resolving this issue

I would request help; I would like to see an example; I would like to understand the cause of the issue.

dxtrous commented 6 months ago

Looks like one of the pipeline steps has introduced spurious UTF escaping of your characters. Easily fixable but likely to be an annoyance.

To help pin-point the cause, could you replace the pw.io.csv.write line by pw.debug.compute_and_print(documents), look at your terminal output, and see if the problem persists?

abdul756 commented 6 months ago

Yeah I will try and update you

abdul756 commented 6 months ago

ﻣﻣﻠﻛﺔ\n\nﺗﻛون\n\nاﻟﺗﻲ\n\nاﻟدوﻟﯾﺔ\n\nواﻻﺗﻔﺎﻗﯾﺎت\n\nواﻟﻣﻌﺎھدات\n\nاﻷﻧظﻣﺔ\n\nﺑﮫ\n\nﺗﻘﺿﻲ\n\nﻣﺎ\n\nﻣراﻋﺎة\n\nﻣﻊ\n\nﻋﻠﯾﮭم.\n\nأو\n\nﻣﻧﮭم\n\nﻛﺎﻧت ﺳواءً\n\nﺗﻘﺎم\n\nاﻟﺗﻲ\n\nاﻟدﻋﺎوى\n\nﻓﻲ ،\n\nﺟﻧﺎﺋﯾﺔ\n\nﻏﯾر\n\nﻣﺎﻟﯾﺔ\n\nﻗﺿﺎﯾﺎ\n\nﻓﻲ\n\nاﻟﻘﺿﺎﺋﯾﺔ\n\nاﻟﺗﻛﺎﻟﯾف\n\nاﺳﺗﺣﻘﺎق\n\nوﻗت\n\nواﻟﻣوﻗوﻓون\n\nاﻟﻣﺳﺟوﻧون\n\n.1\n\nﻋﻣل.\n\nﻋﻘود\n\nﻋن\n\nاﻟﻧﺎﺷﺋﺔ\n\nﺑﻣﺳﺗﺣﻘﺎﺗﮭم\n\n؛ ﻟﻠﻣطﺎﻟﺑﺔ\n\nﻋﻧﮭم\n\nواﻟﻣﺳﺗﺣﻘون\n\nﻣﻧﮫ\n\nواﻟﻣﺳﺗﺛﻧون\n\nاﻟﻌﻣل\n\nﺑﻧظﺎم\n\nاﻟﻣﺷﻣوﻟون\n\nاﻟﻌﻣﺎل\n\n.2\n\nاﻟﺣﻛوﻣﯾﺔ.\n\nواﻷﺟﮭزة\n\nاﻟوزارات\n\n.3\n\nﺑذﻟك.\n\nاﻟﺧﺎﺻﺔ\n\nواﻟﻘواﻋد\n\nاﻹﺟراءات\n\nاﻟﻼﺋﺣﺔ\n\nوﺗﺣدد\n\nﻋﺷرة\n\nاﻟﺛﺎﻣﻧﺔ\n\nاﻟﻘﺿﺎﺋﯾﺔ.\n\nاﻟﺗﻛﺎﻟﯾف\n\nﺑدﻓﻊ ﻋﻠﯾﮫ\n\nاﻟﻣﺣﻛوم\n\nﻓﯾﻠزم\n\nاﻟﻘﺿﺎﺋﯾﺔ\n\nاﻟﺗﻛﺎﻟﯾف\n\nﻣن\n\nاﻟﻣُﻌﻔﻰ\n\nﻟﻣﺻﻠﺣﺔ\n\nاﻟدﻋوى\n\nﻓﻲ ﺣﻛم\n\nﺻدر\n\nإذا\n\n(،\n\nﻋﺷرة\n\nاﻟﺳﺎﺑﻌﺔ\n\nاﻟﻣﺎدة )\n\nﺑﮫ\n\nﺗﻘﺿﻲ\n\nﻣﺎ\n\nﻣراﻋﺎة\n\nﻣﻊ\n\nاﻟرد.\n\nﻣﺳوﻏﺎت\n\nﺗواﻓرت\n\nإذا\n\nِھﺎ\n\nوردّ\n\nﻋﻠﯾﮫ.\n\nواﻹﺷراف\n\nﻋﻣﻠﮫ\n\nاﻟﻘﺿﺎﺋﯾﺔ ،\n\nإﺟراءات\n\nﻋﺷرة\n\nاﻟﺗﺎﺳﻌﺔ\n\nاﻟﺳﻌودي.\n\nاﻟﻣرﻛزي\n\nاﻟﺑﻧك\n\nﻟدى\n\nاﻟﻣﺎﻟﯾﺔ\n\nوزارة\n\nﺟﺎري\n\nﺣﺳﺎب ﻓﻲ\n\nاﻟﻣﺣﺻﻠﺔ\n\nاﻟﻘﺿﺎﺋﯾﺔ\n\nاﻟﺗﻛﺎﻟﯾف\n\nﻣﺑﺎﻟﻎ\n\nﺗودع\n\nاﻟﻌﺷرون\n\nاﻟﺗﻛﺎﻟﯾف\n\nﺑﺗﺣﺻﯾل\n\nاﻟطﻠب -\n\nإﻟﯾﮭﺎ\n\nاﻟﻣﻘدم\n\nأو ،\n\nاﻟدﻋوى\n\nإﻟﯾﮭﺎ\n\nاﻟﻣرﻓوع\n\nاﻟﻣﺣﻛﻣﺔ\n\nﻓﻲ -\n\nاﻟﻣﺧﺗﺻﺔ\n\nاﻹدارة\n\nﻣﻧﮫ\n\nﺑﻘرار\n\nاﻟﻌدل\n\nوزﯾر\n\nﯾﺣدد\n\nواﻟﻌﺷرون\n\nاﻟﺣﺎدﯾﺔ\n\nوﻗواﻋد\n\nﻟﮫ\n\nاﻟﺗراﺧﯾص\n\nأﺣﻛﺎم\n\nاﻟﻼﺋﺣﺔ\n\nوﺗﺣدد\n\nاﻟﻧظﺎم.\n\nﻟﺗطﺑﯾﻖ\n\nاﻟﻣﺳﺎﻧدة\n\nﺑﺎﻷﻋﻣﺎل\n\nﺑﺎﻟﻘﯾﺎم\n\nاﻟﺧﺎص\n\nﻟﻠﻘطﺎع\n\nاﻟﺗرﺧﯾص\n\nاﻟﻌدل\n\nﻟوزﯾر\n\nواﻟﻌﺷرون\n\nاﻟﺛﺎﻧﯾﺔ\n\nاﻟوزراء.\n\nﻣﺟﻠس ﻣن\n\nﺑﻘرار\n\nوﺗﺻدر\n\nاﻟﻧظﺎم ،\n\nﺻدور\n\nﺗﺎرﯾﺦ\n\nﻣن\n\nﯾوﻣﺎً\n\nﺳﺗﯾن (\n\nﺧﻼل )\n\nاﻟﻼﺋﺣﺔ\n\nاﻟﻌدل\n\nوزارة\n\nﺗﻌد\n\nواﻟﻌﺷرون\n\nاﻟﺛﺎﻟﺛﺔ\n\nاﻟرﺳﻣﯾﺔ.\n\nاﻟﺟرﯾدة\n\nﻓﻲ\n\nﻧﺷره\n\nﺗﺎرﯾﺦ\n\nﻣن\n\nﯾوﻣﺎً\n\nوﺛﻣﺎﻧﯾن (\n\nﻣﺎﺋﺔ\n\nﺑﻌد )\n\nﺑﺎﻟﻧظﺎم\n\nﯾﻌﻣل\n\n2021 2021\n\nاﻟﻌﺪل اﻟﻌﺪل\n\nﻟﻮزارة ﻟﻮزارة\n\n© ©\n\nﻣﺤﻔﻮﻇﺔ ﻣﺤﻔﻮﻇﺔ\n\nاﻟﺤﻘﻮق اﻟﺤﻘﻮق\n\nﺟﻤﻴﻊ ﺟﻤﻴﻊ', pw.Json({'filetype': 'application/pdf', 'languages': ['eng'], 'links': [], 'page_number': 4})),)

when i made the mode static and used pw.debug.compute_and_print(documents) the output in terminal is able to show in arabic , so how to fix this in streaming mode while using pw.io.csv.write

berkecanrizai commented 6 months ago

Hey @abdul756 , streaming mode is only usable when you are running the app with pw.run().

In this case, seems like parser is working ok. If you want to dump the text into some file and keep the app running (so that when a new content or new file arrives, new data is put into your csv file), you can run your code with streaming mode enabled, you can achieve this with the addition of pw.run() at the end of the code.

So, it will look as:

documents = folder.select(text=parser(pw.this.data))
pw.io.csv.write(documents, "output_stream_en_7.csv")

pw.run()

You can run this in notebook or in regular python file.

This will start the pipeline that will keep running until you close. After running the pw.run, you will see the output file being created.

If, you are interested in taking a dump for one time in a static manner, you can run the following:

df = pw.debug.table_to_pandas(documents)
df.to_csv("outputs_en.csv")

this will put the content into csv file. In this case, we take data into Pandas DataFrame, then write it to a file. This one doesn't need a pw.run() since it is statically running for single time.

abdul756 commented 6 months ago

In streaming mode the output is not coming its giving only unicode characters, you can refer the file i attached with the issue

berkecanrizai commented 6 months ago

In streaming mode the output is not coming its giving only unicode characters, you can refer the file i attached with the issue

Yes, I just replicated the issue with another file. The static mode works ok (refer to the df.to_csv snippet above), and internally, the Pathway table also stores the data correctly in Arabic characters, however writing them with csv casts them to Unicode. So, you can use it in your app without any issues, in the meantime, we are investigating. Will keep updated.

abdul756 commented 6 months ago

Sure thanks

KamilPiechowiak commented 6 months ago

@abdul756 thanks for reporting this. The problem will be fixed in the next release (it'll be released this week).

KamilPiechowiak commented 6 months ago

Hey @abdul756, Today Pathway v0.11.0 has been released. Your problem should be fixed now. However nobody in the team knows Arabic language. Could you update your pathway version and confirm that it is a satisfactory solution to your problem?

abdul756 commented 6 months ago

I will test and update

KamilPiechowiak commented 2 months ago

Closing as the problem was fixed and there was not final user feedback. @abdul756 feel free to reopen if sth is wrong with the arabic language representation.