Fix State transition err when hdfs ccm feature is used

TanYuxin-tyx commented 5 years ago

When we used dynamometer to test HDFS performance, the test encountered a error when generate DataNode Block info. The error stack is Error: java.io.IOException: State transition not allowed; from DEFAULT to FILE_WITH_REPLICATION at com.linkedin.dynamometer.blockgenerator.XMLParser.transitionTo(XMLParser.java:107) at com.linkedin.dynamometer.blockgenerator.XMLParser.parseLine(XMLParser.java:77) at com.linkedin.dynamometer.blockgenerator.XMLParserMapper.map(XMLParserMapper.java:53) at com.linkedin.dynamometer.blockgenerator.XMLParserMapper.map(XMLParserMapper.java:26) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:151) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:828) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1690) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)

After checking Fsimage xml and the source code, we find that XMLParser can not parse the lines correctly, these lines are like `8963/user/somepath/path13cache_other_pool1544454142310false

8964/user/somepath/path23cache_hadoop-data_pool1544497817686false 8965/user/somepath/path33cache_hadoop-peisong_pool1544451500312false 8967/user/somepath/path43cache_other_pool1544497602570false` These fsimage xml lines are generated when [HDFS Centralized Cache Management (CCM)](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html) feature is used. So we add a patch to ignore the CCM fsimage xml lines when parsing xml.

TanYuxin-tyx commented 5 years ago

@xkrogen hi Erik, could you please have a look?

xkrogen commented 5 years ago

Hi @yuxintan , thanks for reporting this. We don't use the caching feature so we haven't encountered this issue.

I wonder if there's a better way to handle this by adding support for the CCM feature to Dynamometer? I'm not sure how much work this would be, since I don't know much about how it works.

For this patch in particular, I'm fine with excluding the directives, but not too sure about the current logic. First off, is there really no </directive> closing the XML tag? If not, this is a bug with the XML generator -- the XML you've pasted is broken. I took a look at the code, and at least on branch-2.7, I see code that should be generating the closing tag.

Second, is it guaranteed that a line containing <directive> will never also contain a real file tag, and that the directive line won't be split across multiple lines? I want to ensure that we don't accidentally skip files, and that we don't accidentally only skip part of the directive. If possible, I would rather skip the entire <CacheManagerSection>...</CacheManagerSection> to avoid any such issues and make it very clear what is being skipped.

TanYuxin-tyx commented 5 years ago

@xkrogen Thanks very much for reviewing the patch.

CCM feature improve HDFS performance by storing some specific files' blocks in DataNode's memory, NameNode only need save these paths in its memory, there is little or no effect on NN. So we don't add support for the CCM feature to Dynamometer.

For the XML tag issue, it's the lack of clarity in my description led to confusion. There is </directive> closing the XML tag after <directive><id> line. The CCM XML lines are exactly like

<directive><id>8963</id><path>/user/somepath/path1</path><replication>3</replication><pool>cache_other_pool</pool><expiration><millis>1544454142310</millis><relatilve>false</relatilve></expiration> 
</directive>
<directive><id>8964</id><path>/user/somepath/path2</path><replication>3</replication><pool>cache_hadoop-data_pool</pool><expiration><millis>1544497817686</millis><relatilve>false</relatilve></expiration> 
</directive>
<directive><id>8965</id><path>/user/somepath/path3</path><replication>3</replication><pool>cache_hadoop-peisong_pool</pool><expiration><millis>1544451500312</millis><relatilve>false</relatilve></expiration> 
</directive>
<directive><id>8967</id><path>/user/somepath/path4</path><replication>3</replication><pool>cache_other_pool</pool><expiration><millis>1544497602570</millis><relatilve>false</relatilve></expiration>
</directive>

and the directive line won't be split across multiple lines, which are similar to other XML <inode> lines.

The patch only ignores these lines and bypass the problem, but I'm quite of your opinion that it's more elegant to avoid any such issues by skipping the entire <CacheManagerSection>...</CacheManagerSection>.

Thank you very much for your attention and kindly advice.

xkrogen commented 5 years ago

Got it, thanks for the detailed description.

I'm re-visiting this and feel like the right way to solve this is actually to add a new State, INODE_SECTION. When the <INodeSection> tag is seen, transition into INODE_SECTION, when </INodeSection> is seen, transition out. Then, the rest of parseLine() can simply ignore text whenever the state is DEFAULT. Then any <inode> tag outside of the INode section is completely ignored.

Let me know what you think.

xkrogen commented 5 years ago

Hey @yuxintan , I put up a PR with my described idea at PR #82, can you check and see if this solves your issue?

linkedin / dynamometer

Fix State transition err when hdfs ccm feature is used #77