Closed abdul756 closed 2 months ago
Looks like one of the pipeline steps has introduced spurious UTF escaping of your characters. Easily fixable but likely to be an annoyance.
To help pin-point the cause, could you replace the pw.io.csv.write
line by pw.debug.compute_and_print(documents)
, look at your terminal output, and see if the problem persists?
Yeah I will try and update you
ﻣﻣﻠﻛﺔ\n\nﺗﻛون\n\nاﻟﺗﻲ\n\nاﻟدوﻟﯾﺔ\n\nواﻻﺗﻔﺎﻗﯾﺎت\n\nواﻟﻣﻌﺎھدات\n\nاﻷﻧظﻣﺔ\n\nﺑﮫ\n\nﺗﻘﺿﻲ\n\nﻣﺎ\n\nﻣراﻋﺎة\n\nﻣﻊ\n\nﻋﻠﯾﮭم.\n\nأو\n\nﻣﻧﮭم\n\nﻛﺎﻧت ﺳواءً\n\nﺗﻘﺎم\n\nاﻟﺗﻲ\n\nاﻟدﻋﺎوى\n\nﻓﻲ ،\n\nﺟﻧﺎﺋﯾﺔ\n\nﻏﯾر\n\nﻣﺎﻟﯾﺔ\n\nﻗﺿﺎﯾﺎ\n\nﻓﻲ\n\nاﻟﻘﺿﺎﺋﯾﺔ\n\nاﻟﺗﻛﺎﻟﯾف\n\nاﺳﺗﺣﻘﺎق\n\nوﻗت\n\nواﻟﻣوﻗوﻓون\n\nاﻟﻣﺳﺟوﻧون\n\n.1\n\nﻋﻣل.\n\nﻋﻘود\n\nﻋن\n\nاﻟﻧﺎﺷﺋﺔ\n\nﺑﻣﺳﺗﺣﻘﺎﺗﮭم\n\n؛ ﻟﻠﻣطﺎﻟﺑﺔ\n\nﻋﻧﮭم\n\nواﻟﻣﺳﺗﺣﻘون\n\nﻣﻧﮫ\n\nواﻟﻣﺳﺗﺛﻧون\n\nاﻟﻌﻣل\n\nﺑﻧظﺎم\n\nاﻟﻣﺷﻣوﻟون\n\nاﻟﻌﻣﺎل\n\n.2\n\nاﻟﺣﻛوﻣﯾﺔ.\n\nواﻷﺟﮭزة\n\nاﻟوزارات\n\n.3\n\nﺑذﻟك.\n\nاﻟﺧﺎﺻﺔ\n\nواﻟﻘواﻋد\n\nاﻹﺟراءات\n\nاﻟﻼﺋﺣﺔ\n\nوﺗﺣدد\n\nﻋﺷرة\n\nاﻟﺛﺎﻣﻧﺔ\n\nاﻟﻘﺿﺎﺋﯾﺔ.\n\nاﻟﺗﻛﺎﻟﯾف\n\nﺑدﻓﻊ ﻋﻠﯾﮫ\n\nاﻟﻣﺣﻛوم\n\nﻓﯾﻠزم\n\nاﻟﻘﺿﺎﺋﯾﺔ\n\nاﻟﺗﻛﺎﻟﯾف\n\nﻣن\n\nاﻟﻣُﻌﻔﻰ\n\nﻟﻣﺻﻠﺣﺔ\n\nاﻟدﻋوى\n\nﻓﻲ ﺣﻛم\n\nﺻدر\n\nإذا\n\n(،\n\nﻋﺷرة\n\nاﻟﺳﺎﺑﻌﺔ\n\nاﻟﻣﺎدة )\n\nﺑﮫ\n\nﺗﻘﺿﻲ\n\nﻣﺎ\n\nﻣراﻋﺎة\n\nﻣﻊ\n\nاﻟرد.\n\nﻣﺳوﻏﺎت\n\nﺗواﻓرت\n\nإذا\n\nِھﺎ\n\nوردّ\n\nﻋﻠﯾﮫ.\n\nواﻹﺷراف\n\nﻋﻣﻠﮫ\n\nاﻟﻘﺿﺎﺋﯾﺔ ،\n\nإﺟراءات\n\nﻋﺷرة\n\nاﻟﺗﺎﺳﻌﺔ\n\nاﻟﺳﻌودي.\n\nاﻟﻣرﻛزي\n\nاﻟﺑﻧك\n\nﻟدى\n\nاﻟﻣﺎﻟﯾﺔ\n\nوزارة\n\nﺟﺎري\n\nﺣﺳﺎب ﻓﻲ\n\nاﻟﻣﺣﺻﻠﺔ\n\nاﻟﻘﺿﺎﺋﯾﺔ\n\nاﻟﺗﻛﺎﻟﯾف\n\nﻣﺑﺎﻟﻎ\n\nﺗودع\n\nاﻟﻌﺷرون\n\nاﻟﺗﻛﺎﻟﯾف\n\nﺑﺗﺣﺻﯾل\n\nاﻟطﻠب -\n\nإﻟﯾﮭﺎ\n\nاﻟﻣﻘدم\n\nأو ،\n\nاﻟدﻋوى\n\nإﻟﯾﮭﺎ\n\nاﻟﻣرﻓوع\n\nاﻟﻣﺣﻛﻣﺔ\n\nﻓﻲ -\n\nاﻟﻣﺧﺗﺻﺔ\n\nاﻹدارة\n\nﻣﻧﮫ\n\nﺑﻘرار\n\nاﻟﻌدل\n\nوزﯾر\n\nﯾﺣدد\n\nواﻟﻌﺷرون\n\nاﻟﺣﺎدﯾﺔ\n\nوﻗواﻋد\n\nﻟﮫ\n\nاﻟﺗراﺧﯾص\n\nأﺣﻛﺎم\n\nاﻟﻼﺋﺣﺔ\n\nوﺗﺣدد\n\nاﻟﻧظﺎم.\n\nﻟﺗطﺑﯾﻖ\n\nاﻟﻣﺳﺎﻧدة\n\nﺑﺎﻷﻋﻣﺎل\n\nﺑﺎﻟﻘﯾﺎم\n\nاﻟﺧﺎص\n\nﻟﻠﻘطﺎع\n\nاﻟﺗرﺧﯾص\n\nاﻟﻌدل\n\nﻟوزﯾر\n\nواﻟﻌﺷرون\n\nاﻟﺛﺎﻧﯾﺔ\n\nاﻟوزراء.\n\nﻣﺟﻠس ﻣن\n\nﺑﻘرار\n\nوﺗﺻدر\n\nاﻟﻧظﺎم ،\n\nﺻدور\n\nﺗﺎرﯾﺦ\n\nﻣن\n\nﯾوﻣﺎً\n\nﺳﺗﯾن (\n\nﺧﻼل )\n\nاﻟﻼﺋﺣﺔ\n\nاﻟﻌدل\n\nوزارة\n\nﺗﻌد\n\nواﻟﻌﺷرون\n\nاﻟﺛﺎﻟﺛﺔ\n\nاﻟرﺳﻣﯾﺔ.\n\nاﻟﺟرﯾدة\n\nﻓﻲ\n\nﻧﺷره\n\nﺗﺎرﯾﺦ\n\nﻣن\n\nﯾوﻣﺎً\n\nوﺛﻣﺎﻧﯾن (\n\nﻣﺎﺋﺔ\n\nﺑﻌد )\n\nﺑﺎﻟﻧظﺎم\n\nﯾﻌﻣل\n\n2021 2021\n\nاﻟﻌﺪل اﻟﻌﺪل\n\nﻟﻮزارة ﻟﻮزارة\n\n© ©\n\nﻣﺤﻔﻮﻇﺔ ﻣﺤﻔﻮﻇﺔ\n\nاﻟﺤﻘﻮق اﻟﺤﻘﻮق\n\nﺟﻤﻴﻊ ﺟﻤﻴﻊ', pw.Json({'filetype': 'application/pdf', 'languages': ['eng'], 'links': [], 'page_number': 4})),)
when i made the mode static and used pw.debug.compute_and_print(documents) the output in terminal is able to show in arabic , so how to fix this in streaming mode while using pw.io.csv.write
Hey @abdul756 , streaming mode is only usable when you are running the app with pw.run()
.
In this case, seems like parser is working ok. If you want to dump the text into some file and keep the app running (so that when a new content or new file arrives, new data is put into your csv file), you can run your code with streaming mode enabled, you can achieve this with the addition of pw.run()
at the end of the code.
So, it will look as:
documents = folder.select(text=parser(pw.this.data))
pw.io.csv.write(documents, "output_stream_en_7.csv")
pw.run()
You can run this in notebook or in regular python file.
This will start the pipeline that will keep running until you close. After running the pw.run, you will see the output file being created.
If, you are interested in taking a dump for one time in a static manner, you can run the following:
df = pw.debug.table_to_pandas(documents)
df.to_csv("outputs_en.csv")
this will put the content into csv file. In this case, we take data into Pandas DataFrame, then write it to a file. This one doesn't need a pw.run()
since it is statically running for single time.
In streaming mode the output is not coming its giving only unicode characters, you can refer the file i attached with the issue
In streaming mode the output is not coming its giving only unicode characters, you can refer the file i attached with the issue
Yes, I just replicated the issue with another file. The static mode works ok (refer to the df.to_csv
snippet above), and internally, the Pathway table also stores the data correctly in Arabic characters, however writing them with csv casts them to Unicode.
So, you can use it in your app without any issues, in the meantime, we are investigating. Will keep updated.
Sure thanks
@abdul756 thanks for reporting this. The problem will be fixed in the next release (it'll be released this week).
Hey @abdul756, Today Pathway v0.11.0 has been released. Your problem should be fixed now. However nobody in the team knows Arabic language. Could you update your pathway version and confirm that it is a satisfactory solution to your problem?
I will test and update
Closing as the problem was fixed and there was not final user feedback. @abdul756 feel free to reopen if sth is wrong with the arabic language representation.
Parser giving unicode characters for Arabic language how to parse files for languages other than english Am attaching the output
Code
نظام التكاليف القضائية.pdf
output_stream_ar_7.csv Please help in resolving this issue
I would request help; I would like to see an example; I would like to understand the cause of the issue.