open-automation / switch-portals

Route jobs in a Switch flow without a direct connector.
MIT License
2 stars 5 forks source link

Unarchiving of the job to the temporary folder was unsuccessful. #26

Open p-storey opened 5 years ago

p-storey commented 5 years ago

Hi,

I'm having the following problem with some of my portals: Unarchiving of the job to the temporary folder was unsuccessful.

The file that caused this issue was 4GB, I have seen larger files process without error.

Invalid XML document: C:/Users//AppData/Roaming/Enfocus/Switch Server/temp/275/ScriptElement/1752/365/unarchiveTemp0ARHA\jobTicket\ticket.xml

Error in line 137 of script : Error. File 'C:/Users//AppData/Roaming/Enfocus/Switch Server/temp/275/ScriptElement/1752/365/unarchiveTemp0ARHA\jobTicket\ticket.xml' does not exist

I did see that a similar error was possibly closed a while ago.

Thanks

Pete

joyd42 commented 5 years ago

I have the same error with a customer now.

The error happens because there is a zip in the ether folder where there is no ticket.xml file. Line 137 which is mentioned in the error is the line where the he tries to read this not existing file. I had to manually remove the job from the ether folder in order to get the flows running again.

If I check my messages for the history of that job, then I also see that the outgoing portal outputted this job earlier with some suffixes which suggest the job was not fully written at that point. Filenames like this:

Intermarche_Bache-190x490_HighRes.pdf.5dreq01.partial strategie-3355f5.zip.e6ffwtf.partial

The file that triggered the issue here is 500 MB, so it does seem to be related to big jobs.

Domonick made one github contribution in the last year. He seems more active on the Enfocus forums, but there he also mentions that he hasen't used Switch in a while: [https://forum.enfocus.com/viewtopic.php?f=13&t=2722&p=9022#p9022]

So to me it looks like he is no longer maintaining these Apps since he is no longer working with Switch.

@dominickp, am I correct in this assumption?

For now I think I'll move back to the old system of interconnecting folders and keeping the unique prefixes instead of portals.

Maybe one day if I have sufficient time I'll create a similar project with a different mechanism to know if files are already fully written.

DeschampsThomasSobook commented 5 years ago

Ok, same problem here, but i've also notice that it also trigger this fail, when you have a long enough directory outcapaciting windows max length.

@joyd42, there is a test built in switch : hasArrived( fileName : String, stabilitySeconds : Number ) : Boolean [static] https://www.enfocus.com/manuals/DeveloperGuide/SW/19/home.html#en-us/common/sw/reference/r_file_class.html#checking_for_arrival

I'm working on code review for the both outgoing and incoming script, and will try a fix in a few day

joyd42 commented 5 years ago

It would be really nice to have a fixed version of the Portals App, I haven't had the time to write and test a good fix myself.

Using hasArrived will most likely avoid the issue, however it will cause an additional delay in processing caused by the time it waits to be 100% sure the file is fully written.

I think an alternative could be: 1 Write a lock file in the ether folder before moving the job file with datasets and metadata file there 2 Move the job file, datasets and metadata file to the ether folder. There is no need to zip them since this will cause an additional delay 3 Remove the lock file again once the files are fully written 4 In the portal which is picking up jobs, check in timerfired for all jobs where there is no lock file next to and bring them in the flow.

This way the delay between the incoming and outgoing portal will be: time to move jobs to the ether folder + a part of the timerfired interval + time to move jobs to the next flow Using hasArrived the delay will be: time to move jobs to the ether folder + a part of the timerfired interval + time that the hasArrived waits +time to move jobs to the next flow

You could even avoid the timerfired interval delay by using webhooks as a signal between portal Apps when jobs are ready to be picked up. (the incoming portal sends a http request when a job is ready in the ether folder to be picked up. the outgoing portal receives the http request as a webhook and knows it can start to process the file. ) Downside of this method is that the webhooks port can be changed in the Switch preferences. So you would have to assume that it is not changed from default, or have a setting in the portal app where users have to fill in the webhook port.

While my suggestions have merits, they also require bigger code changes compared to your suggestion. So please feel free to write your script just the way you want it to :)

DeschampsThomasSobook commented 5 years ago

Thanks for the feedbacks @joyd42 hasArrived have a typical time of 1second, and it's not much comparing to the 20sc from timeFired. And it's a lot easier to implement!

When arguing about the topic with colleague, i'll also implement "try/catch" when reading the XML in order to "pass" to the next job if a fail happend.

joyd42 commented 5 years ago

If you go the hasArrived route, then I think it is best to test with really big files and see if the filesize of the zip constantly increases.

I haven't checked the results of the switch zip function, but I've seen other software where writing a big file doesn't mean that the filesize constantly increases. There can be "breaks" in the increasing of the filesize which are longer then a second.

DeschampsThomasSobook commented 5 years ago

Hi, I managed to trigger the bug even on small file (when setting the time fired at 1s, and with a lot of the same dummy file). So it's seems to be really tied with the "outgoing" trying to access a file not fully written at the moment.

You'll find attached a new version, with some minor tweak :

At the moment if an "old" zipped problem file is still present, it will trigger the same error (at line 149 now). I'm working on a test in order to check if the zip is ok, and if not skipp the file with a error logd

Outgoing_V4.zip

)

DeschampsThomasSobook commented 5 years ago

By the way, feel free to test this one and give your FeedBack.

@dominickp can you update this to the Enfocus Store if the test are successfull? Or did you mind if i contact Enfocus in order to update?

Dominick-Peluso-Bose commented 5 years ago

@DeschampsThomasSobook this is fantastic. I'm so glad that someone is interested in further developing Portals since I've lost my access to Switch.

Do you mind making a pull request for these changes? That way, we can all review the diff and merge it into the code easily if we are happy with the test results. If you are unfamiliar:

joyd42 commented 5 years ago

I work for an Enfocus reseller so I don't have a production Switch of my own where I could test this I'm afraid. My 2 customers where I used the portals apps and had issues only had them once every 3-4 weeks or so, so a test to see if it still works in their environment will take to long. Thanks for all your work in this!

DeschampsThomasSobook commented 5 years ago

@Dominick-Peluso-Bose I'm not yet familiar with git, i'll try :)

@joyd42 You could test this by : Creating a special flow with the element "dummy job clock" for a new file each few second Injecting a "big" file (i tested with 500 mo) Using portal, with the new outgoing setting property : scan for new file each 1s image

I tested this week with method and had ~10k file without error.

DeschampsThomasSobook commented 5 years ago

@Dominick-Peluso-Bose Ok, i think i used the fork correctly :)

joyd42 commented 5 years ago

Ok, I have a test running. I'll let you know the result.

One minor point of feedback: there is a typo in the tooltip of the new "Process file only after" setting. "vebore" most likely needs to be "before" :)

joyd42 commented 5 years ago

In my test I injected the PitStop Server installer, which is slightly over 1 GB. I started the test when I left the office and apparently my Switch shut down after half an hour because there was not enough disk space. The reason the disk is full is that the ether folder is 60 GB. @DeschampsThomasSobook, is it possible that the cleanup mechanism for the ether folder is not working ok yet?

joyd42 commented 5 years ago

In case you want to check them, here are my logs (I removed stuff from other flows) and my exported flow. 20190903170303.xlsx

New flow.sflow.zip

DeschampsThomasSobook commented 5 years ago

Oops, bad typo, not my native language, :), will fix this soon.

Tested your flow on my Switch (Windows) without error, the cleanup of the ether file is well done. Maybe it's tied to the clock setting, sending 2 file each 10s (txt, and injected) and the flow take 5s for each in order to check "hasArrived".

That said, it's fixing the initial error, but take way too long for big number of file. I use portal extensively, ~20k time per day (700 file/day multiply by at least 30 portal jump) and for a 5s check, it represent at lot of downtime.

I'll explore your other idea (lock file) by the end of the week if i have time.

Dominick-Peluso-Bose commented 5 years ago

@DeschampsThomasSobook I created a PR from your fork here https://github.com/open-automation/switch-portals/pull/28

Changes you make in your fork will show up here and eventually they can be merged into this project once tested and approved.

DeschampsThomasSobook commented 5 years ago

Hi everyOne. Back after a week on another project. After few test, some odd bug that we could not fix (CallBack function and variable), and efficiency problem, we decided to fully rewrote the code.

We splitted a lot of big function in smaller one, moved some action (rm the zip file) at other places. Rewrotte some "If/Else" with ternary operator for readability etc etc....

As my coworker said :

we are going to write this as if a psychopath were going to read it

Please, be this psycho, and give us a feedback :)

Lab9Pro-Service commented 5 years ago

Hi @DeschampsThomasSobook,

I've just started a test flow. Some notes:

I haven't checked the code itself since I'm not really familiar with the inner working of the script. I'll leave that to @Dominick-Peluso-Bose