k-int / XCRI-Aggregator

XCRI Course Related Information - Feed Validator and Aggregator
10 stars 4 forks source link

invalid characters in the data #59

Open bukira opened 12 years ago

bukira commented 12 years ago

I have found invalid UTF-8 characters in the data being returned

I get it when doing an "art" keyword search and returning xml

the error is in course HNC Art and Design at adam smith college, in the prerequisites entry,

it says " 2 Highers including English and" then the error token

ianibo commented 12 years ago

Heya.. I can't see any invalid chars: I get

2 Highers including English and Art and Design/Craft and Design and 3 Standard Grades at Grade 3 or aboveRelevant NQ courseApplicants will be asked to present evidence within a design portfolio illustrating their creative ability. This can include finished design solutions, photography, illustration, animation, 3D Models, and sketchbooks.

Can you paste in what you see, and let me know what your client is (Platform, toolset, etc)

bukira commented 12 years ago

I am trying on iOS platform and Android and both fail on this Also if i open the xml file in Mac Safari i get the same error and if you open the xml in Windows Visual Studio it also gives a UTF-8 error and places the characters with a "diamond" if you open the xml in Windows IE you get a different character replacement as well

I can email you the xml file so you can view it, it was create via "save as source" in a browser

MIJohnson commented 12 years ago

http://coursedata.k-int.com/discover/?adv=false&q=art&provider=748be706-036d-49a4-88e7-709cfc37f52a&qualification=*&studyMode=*&distance=25&dunit=miles&order=distance&location=&format=xml

when i use the above URL I'm not seeing any invalid characters for that course?

EDIT: whoops sorry had this message pending to go and got sidetracked onto something else.

bukira commented 12 years ago

this is what i see,

this is the first one (dagger between general and courses)

From the introductory and general†courses - NC and HN Textile UnitsFrom the advanced course - full-time study

the one i had before was this one

2 Highers including English and†Art†and Design/Craft†and Design and 3† Standard Grades at Grade 3 or aboveRelevant NQ courseApplicants will be asked to present evidence within a design portfolio illustrating their creative ability.† This can include finished design solutions, photography, illustration, animation, 3D Models, and sketchbooks.

open the url you have above in safari and do "save as / page source" and then open it in Safari, also open in Xcode and see the error

MIJohnson commented 12 years ago

Yep I can see there does seem to be some strange characters in places theres.

ianibo commented 12 years ago

Mark.. can you check in the source document and see whats there?

MIJohnson commented 12 years ago

Ian - I cant see any special characters in the source XML document (saved it locally and inspected it) which I guess might point to it being the grails xml parser?

ianibo commented 12 years ago

Nope they are there.. Just found them myself...possibly the result of a non-blanking space

On 20 June 2012 13:31, MIJohnson < reply@reply.github.com

wrote:

Ian - I cant see any special characters in the source XML document (saved it locally and inspected it) which I guess might point to it being the grails xml parser?


Reply to this email directly or view it on GitHub: https://github.com/k-int/XCRI-Aggregator/issues/59#issuecomment-6452581

Ian Ibbotson W: http://ianibbo.me E: ianibbo@gmail.com skype: ianibbo twitter: ianibbo

MIJohnson commented 12 years ago

Ok I see them now if I use WordPad, my other Text editor must have been playing nice and hiding them from me.

ianibo commented 12 years ago

Problem seems to be around line 10606 in original cap source file. particularly around

                    <description type="Entry Profile">
                            <div xmlns="http://www.w3.org/1999/xhtml" xmlns:my="http://schemas.microsoft.com/office/infopath/2003/myXSD/2005-06-28T14:08:18" xmlns:xd="http://schemas.microsoft.com/office/infopath/2003" xmlns:xp_0="http://www.w3.org/2001/XMLSchema-instance">
                                    <ul>
                                            <li>2 Highers including English and Art and Design/Craft and Design and 3  Standard Grades at Grade 3 or above</li>
                                            <li>Relevant NQ course</li>
                                    </ul>
                                    <p>Applicants will be asked to present evidence within a design portfolio illustrating their creative ability.  This can include finished design solutions, photography, illustration, animation, 3D Models, and sketchbooks.</p>
                            </div>
                    </description>

Strangely tho, xmlint does not choke on this even with the extra characters in there.

ianibo commented 12 years ago

More investigation seems to suggest it's in the stringification of the nested div within the description element.

It looks like the document might contain UTF-8 characters like soft hyphen. It's not clear to me why these might get munged on the output side tho.. There also seem to be instances of 00A0 - Non-breaking space. Which is also valid UTF-8. We're investigating.

bukira commented 12 years ago

cheers chaps, much appreciate

ianibo commented 12 years ago

00A0 is appearing only as A0 in the output. Looks like it could be an issue in the serialization of the xml document.

bukira commented 12 years ago

Is this fixed? i dont seem to be getting any errors anymore

bukira commented 12 years ago

its still happening

UCLan-MobileApps commented 12 years ago

has this been fixed? it still appears to happen

rob-work commented 12 years ago

Hi Mark, can you let me know how this is progressing as the issue appears to be holding up progress on an Android app for the Elevator project. Cheers Rob

ianibo commented 12 years ago

All watching this issue, I'm not able to reproduce this. Can you please paste URLs of queries that show the problem to this issue. For example, I used

http://coursedata.k-int.com/discover/?adv=false&q=art&provider=748be706-036d-49a4-88e7-709cfc37f52a&qualification=*&studyMode=*&distance=25&dunit=miles&order=distance&location=&format=xml

and everything checks out.

Would also be most helpful if you could please add any error messages you see, and details about your environments.

ianibo commented 12 years ago

Just bumping this and including @bukira and @UCLan-MobileApps Can you please add some URLs that show the problem and more data to the issue. Cheers.

UCLan-MobileApps commented 12 years ago

this only happens with Adam Smith College searches, if i do a search with "any" for provider , so provider=* then it works fine, but if i specify adam smith college via its code http://coursedata.k-int.com/discover/?adv=true&q=a&provider=458ce390-b91c-4954-9896-b64e776506d5&qualification=*&studyMode=*&distance=25&dunit=miles&order=distance&location=&max=100&offset=0&format=xml

then i get the following error

org.apache.harmony.xml.ExpatParser$ParseException: At line 1, column 195633: not well-formed (invalid token)

I am using Android on Eclipse but also C sharp in visual studio also has the error, this only seems to happen for Adam smith college data

scottbw commented 12 years ago

I've tried the same query and I get encoding issues; e.g. performing:

wget "http://coursedata.k-int.com/discover/?adv=true&q=a&provider=458ce390-b91c-4954-9896-b64e776506d5&qualification=*&studyMode=*&distance=25&dunit=miles&order=distance&location=&max=100&offset=0&format=xml" -O test.xml

... results in an XML document that can't be opened by most applications. Whereas:

wget "http://coursedata.k-int.com/discover/?adv=true&provider=e8f7f1d3-6f4e-46a7-868c-5f687ff8395a&q=medicine&format=xml" -O test.xml

results in something that opens OK. So the problem probably lies in an error in the text encoding from the Adam Smith feed not being fixed before export.